* allowing for a completely cached umount(2) pathwalk
@ 2023-04-13 22:00 Jeff Layton
2023-04-13 22:25 ` Andreas Dilger
` (3 more replies)
0 siblings, 4 replies; 46+ messages in thread
From: Jeff Layton @ 2023-04-13 22:00 UTC (permalink / raw)
To: Al Viro, Christian Brauner
Cc: Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells,
NeilBrown, Christoph Hellwig
David Wysochanski posted this a week ago:
https://lore.kernel.org/linux-nfs/CALF+zOnizN1KSE=V095LV6Mug8dJirqk7eN1joX8L1-EroohPA@mail.gmail.com/
It describes a situation where there are nested NFS mounts on a client,
and one of the intermediate mounts ends up being unexported from the
server. In a situation like this, we end up being unable to pathwalk
down to the child mount of these unreachable dentries and can't unmount
anything, even as root.
A decade ago, we did some work to make the kernel not revalidate the
leaf dentry on a umount [1]. This helped some similar sorts of problems
but it doesn't help if the problem is an intermediate dentry.
The idea at the time was that umount(2) is a special case: We are
specifically looking to stop using the mount, so there's nothing to be
gained by revalidating its root dentry and inode.
Based on the problem Dave describes, I'd submit that umount(2) is
special in another way too: It's intended to manipulate the mount table
of the local host, so contacting the backing store (the NFS server in
this case) during a pathwalk doesn't really help anything. All we care
about is getting to the right spot in the mount tree.
A "modest" proposal:
--------------------
This is still somewhat handwavy, but what if we were to make umount(2)
an even more special case for the pathwalk? For the umount(2) pathwalk,
we could:
1/ walk down the dentry tree without calling ->d_revalidate: We don't
care about changes that might have happened remotely. All we care about
is walking down the cached dentries as they are at that moment.
2/ disallow ->lookup operations: a umount is about removing an existing
mount, so the dentries had better already be there.
3/ skip inode ->permission checks. We don't want to check with the
server about our permission to walk the path when we're looking to
unmount. We're walking down the path on the _local_ machine so we can
unuse it. The server should have no say-so in the matter. (We probably
would want to require CAP_SYS_ADMIN or CAP_DAC_READ_SEARCH for this of
course).
We might need other safety checks too that I haven't considered yet.
Is this a terrible idea? Are there potentially problems with
containerized setups if we were to do something like this? Are there
better ways to solve this problem (and others like it)? Maybe this would
be best done with a new UMOUNT_CACHED flag for umount2()?
--
Jeff Layton <jlayton@kernel.org>
[1] 8033426e6bdb vfs: allow umount to handle mountpoints without
revalidating them
^ permalink raw reply [flat|nested] 46+ messages in thread* Re: allowing for a completely cached umount(2) pathwalk 2023-04-13 22:00 allowing for a completely cached umount(2) pathwalk Jeff Layton @ 2023-04-13 22:25 ` Andreas Dilger 2023-04-13 22:41 ` NeilBrown ` (2 subsequent siblings) 3 siblings, 0 replies; 46+ messages in thread From: Andreas Dilger @ 2023-04-13 22:25 UTC (permalink / raw) To: Jeff Layton Cc: Al Viro, Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, NeilBrown, Christoph Hellwig [-- Attachment #1: Type: text/plain, Size: 2954 bytes --] On Apr 13, 2023, at 4:00 PM, Jeff Layton <jlayton@kernel.org> wrote: > > David Wysochanski posted this a week ago: > > https://lore.kernel.org/linux-nfs/CALF+zOnizN1KSE=V095LV6Mug8dJirqk7eN1joX8L1-EroohPA@mail.gmail.com/ > > It describes a situation where there are nested NFS mounts on a client, > and one of the intermediate mounts ends up being unexported from the > server. In a situation like this, we end up being unable to pathwalk > down to the child mount of these unreachable dentries and can't unmount > anything, even as root. > > A decade ago, we did some work to make the kernel not revalidate the > leaf dentry on a umount [1]. This helped some similar sorts of problems > but it doesn't help if the problem is an intermediate dentry. > > The idea at the time was that umount(2) is a special case: We are > specifically looking to stop using the mount, so there's nothing to be > gained by revalidating its root dentry and inode. > > Based on the problem Dave describes, I'd submit that umount(2) is > special in another way too: It's intended to manipulate the mount table > of the local host, so contacting the backing store (the NFS server in > this case) during a pathwalk doesn't really help anything. All we care > about is getting to the right spot in the mount tree. > > A "modest" proposal: > -------------------- > This is still somewhat handwavy, but what if we were to make umount(2) > an even more special case for the pathwalk? For the umount(2) pathwalk, > we could: > > 1/ walk down the dentry tree without calling ->d_revalidate: We don't > care about changes that might have happened remotely. All we care about > is walking down the cached dentries as they are at that moment. > > 2/ disallow ->lookup operations: a umount is about removing an existing > mount, so the dentries had better already be there. > > 3/ skip inode ->permission checks. We don't want to check with the > server about our permission to walk the path when we're looking to > unmount. We're walking down the path on the _local_ machine so we can > unuse it. The server should have no say-so in the matter. (We probably > would want to require CAP_SYS_ADMIN or CAP_DAC_READ_SEARCH for this of > course). > > We might need other safety checks too that I haven't considered yet. > > Is this a terrible idea? Are there potentially problems with > containerized setups if we were to do something like this? Are there > better ways to solve this problem (and others like it)? Maybe this would > be best done with a new UMOUNT_CACHED flag for umount2()? > -- > [1] 8033426e6bdb vfs: allow umount to handle mountpoints without > revalidating them This would be great. This has been a problem when unmounting the filesystem ever since "umount" was changed to stat() the mountpoint by path before unmount, added in util-linux 2.19 (FC15, el7, SLES12). Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-13 22:00 allowing for a completely cached umount(2) pathwalk Jeff Layton 2023-04-13 22:25 ` Andreas Dilger @ 2023-04-13 22:41 ` NeilBrown 2023-04-14 2:43 ` Al Viro 2023-04-14 2:32 ` Al Viro 2023-04-16 23:13 ` [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible NeilBrown 3 siblings, 1 reply; 46+ messages in thread From: NeilBrown @ 2023-04-13 22:41 UTC (permalink / raw) To: Jeff Layton Cc: Al Viro, Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, 14 Apr 2023, Jeff Layton wrote: > David Wysochanski posted this a week ago: > > https://lore.kernel.org/linux-nfs/CALF+zOnizN1KSE=V095LV6Mug8dJirqk7eN1joX8L1-EroohPA@mail.gmail.com/ > > It describes a situation where there are nested NFS mounts on a client, > and one of the intermediate mounts ends up being unexported from the > server. In a situation like this, we end up being unable to pathwalk > down to the child mount of these unreachable dentries and can't unmount > anything, even as root. > > A decade ago, we did some work to make the kernel not revalidate the > leaf dentry on a umount [1]. This helped some similar sorts of problems > but it doesn't help if the problem is an intermediate dentry. > > The idea at the time was that umount(2) is a special case: We are > specifically looking to stop using the mount, so there's nothing to be > gained by revalidating its root dentry and inode. > > Based on the problem Dave describes, I'd submit that umount(2) is > special in another way too: It's intended to manipulate the mount table > of the local host, so contacting the backing store (the NFS server in > this case) during a pathwalk doesn't really help anything. All we care > about is getting to the right spot in the mount tree. > > A "modest" proposal: I hope you didn't mean to reference https://en.wikipedia.org/wiki/A_Modest_Proposal ... > -------------------- > This is still somewhat handwavy, but what if we were to make umount(2) > an even more special case for the pathwalk? For the umount(2) pathwalk, > we could: > > 1/ walk down the dentry tree without calling ->d_revalidate: We don't > care about changes that might have happened remotely. All we care about > is walking down the cached dentries as they are at that moment. > > 2/ disallow ->lookup operations: a umount is about removing an existing > mount, so the dentries had better already be there. > > 3/ skip inode ->permission checks. We don't want to check with the > server about our permission to walk the path when we're looking to > unmount. We're walking down the path on the _local_ machine so we can > unuse it. The server should have no say-so in the matter. (We probably > would want to require CAP_SYS_ADMIN or CAP_DAC_READ_SEARCH for this of > course). > > We might need other safety checks too that I haven't considered yet. > > Is this a terrible idea? Are there potentially problems with > containerized setups if we were to do something like this? Are there > better ways to solve this problem (and others like it)? Maybe this would > be best done with a new UMOUNT_CACHED flag for umount2()? It might be a terrible idea, but it is essentially the same idea that I had, but hadn't got around to posting. The path name that appears in /proc/mounts is the key that must be used to find and unmount a filesystem. When you do that "find"ing you are not looking up a name in a filesystem, you are looking up a key in the mount table. We could, instead, create an api that is given a mount-id (first number in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could read /proc/self/mountinfo, find the mount-id, and unmount it - all without ever doing path name lookup in the traditional sense. But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed LOOKUP_CACHED, and it only finds paths that are in the dcache, never revalidates, at most performs simple permission checks based on cached content. NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-13 22:41 ` NeilBrown @ 2023-04-14 2:43 ` Al Viro 2023-04-14 3:28 ` Trond Myklebust 2023-04-14 10:06 ` Jeff Layton 0 siblings, 2 replies; 46+ messages in thread From: Al Viro @ 2023-04-14 2:43 UTC (permalink / raw) To: NeilBrown Cc: Jeff Layton, Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: > The path name that appears in /proc/mounts is the key that must be used > to find and unmount a filesystem. When you do that "find"ing you are > not looking up a name in a filesystem, you are looking up a key in the > mount table. No. The path name in /proc/mounts is *NOT* a key - it's a best-effort attempt to describe the mountpoint. Pathname resolution does not work in terms of "the longest prefix is found and we handle the rest within that filesystem". > We could, instead, create an api that is given a mount-id (first number > in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could > read /proc/self/mountinfo, find the mount-id, and unmount it - all > without ever doing path name lookup in the traditional sense. > > But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed > LOOKUP_CACHED, and it only finds paths that are in the dcache, never > revalidates, at most performs simple permission checks based on cached > content. umount /proc/self/fd/42/barf/something Discuss. OTON, umount-by-mount-id is an interesting idea, but we'll need to decide what would the right permissions be for it. But please, lose the "mount table is a mapping from path prefix to filesystem" notion - it really, really is not. IIRC, there are systems that work that way, but it's nowhere near the semantics used by any Unices, all variants of Linux included. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 2:43 ` Al Viro @ 2023-04-14 3:28 ` Trond Myklebust 2023-04-14 3:51 ` Al Viro 2023-04-14 10:06 ` Jeff Layton 1 sibling, 1 reply; 46+ messages in thread From: Trond Myklebust @ 2023-04-14 3:28 UTC (permalink / raw) To: Alexander Viro Cc: Neil Brown, Jeffrey Layton, Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig > On Apr 13, 2023, at 22:43, Al Viro <viro@ZenIV.linux.org.uk> wrote: > > On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: > >> The path name that appears in /proc/mounts is the key that must be used >> to find and unmount a filesystem. When you do that "find"ing you are >> not looking up a name in a filesystem, you are looking up a key in the >> mount table. > > No. The path name in /proc/mounts is *NOT* a key - it's a best-effort > attempt to describe the mountpoint. Pathname resolution does not work > in terms of "the longest prefix is found and we handle the rest within > that filesystem". > >> We could, instead, create an api that is given a mount-id (first number >> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could >> read /proc/self/mountinfo, find the mount-id, and unmount it - all >> without ever doing path name lookup in the traditional sense. >> >> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed >> LOOKUP_CACHED, and it only finds paths that are in the dcache, never >> revalidates, at most performs simple permission checks based on cached >> content. > > umount /proc/self/fd/42/barf/something > > Discuss. > > OTON, umount-by-mount-id is an interesting idea, but we'll need to decide > what would the right permissions be for it. > > But please, lose the "mount table is a mapping from path prefix to filesystem" > notion - it really, really is not. IIRC, there are systems that work that way, > but it's nowhere near the semantics used by any Unices, all variants of Linux > included. We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side? Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor. _________________________________ Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 3:28 ` Trond Myklebust @ 2023-04-14 3:51 ` Al Viro 2023-04-14 4:06 ` Trond Myklebust 2023-04-14 9:41 ` Christian Brauner 0 siblings, 2 replies; 46+ messages in thread From: Al Viro @ 2023-04-14 3:51 UTC (permalink / raw) To: Trond Myklebust Cc: Neil Brown, Jeffrey Layton, Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote: > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side? > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor. You can already do umount -l /proc/self/fd/69 if you have a descriptor. Converting mount-id to O_PATH... might be an interesting idea. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 3:51 ` Al Viro @ 2023-04-14 4:06 ` Trond Myklebust 2023-04-14 4:21 ` Al Viro 2023-04-14 9:41 ` Christian Brauner 1 sibling, 1 reply; 46+ messages in thread From: Trond Myklebust @ 2023-04-14 4:06 UTC (permalink / raw) To: Alexander Viro Cc: Neil Brown, Jeffrey Layton, Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig > On Apr 13, 2023, at 23:51, Al Viro <viro@ZenIV.linux.org.uk> wrote: > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote: > >> We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side? >> Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor. > > You can already do umount -l /proc/self/fd/69 if you have a descriptor. > Converting mount-id to O_PATH... might be an interesting idea. A dedicated umountat() might avoid the need for the lazy flag, if it were allowed to close the descriptor on success for the special case of an empty path. Looking more closely, it would seem that CAP_DAC_READ_SEARCH might be a sufficient privilege requirement for the mount-id -> O_PATH syscall. _________________________________ Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 4:06 ` Trond Myklebust @ 2023-04-14 4:21 ` Al Viro 0 siblings, 0 replies; 46+ messages in thread From: Al Viro @ 2023-04-14 4:21 UTC (permalink / raw) To: Trond Myklebust Cc: Neil Brown, Jeffrey Layton, Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 04:06:03AM +0000, Trond Myklebust wrote: > > > > On Apr 13, 2023, at 23:51, Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote: > > > >> We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side? > >> Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor. > > > > You can already do umount -l /proc/self/fd/69 if you have a descriptor. > > Converting mount-id to O_PATH... might be an interesting idea. > > A dedicated umountat() might avoid the need for the lazy flag, if it were allowed to close the descriptor on success for the special case of an empty path. No. It's a wrong abstraction layer, anyway - "close the descriptor" != "make the opened file close", nevermind that it's a very odd corner case that will cause a lot of headache down the road. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 3:51 ` Al Viro 2023-04-14 4:06 ` Trond Myklebust @ 2023-04-14 9:41 ` Christian Brauner 2023-04-14 10:09 ` Jeff Layton 1 sibling, 1 reply; 46+ messages in thread From: Christian Brauner @ 2023-04-14 9:41 UTC (permalink / raw) To: Al Viro, Jeffrey Layton Cc: Trond Myklebust, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote: > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote: > > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side? > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor. > > You can already do umount -l /proc/self/fd/69 if you have a descriptor. Way back when we put together stuff for [2] we had umountat() as an item but decided against it because it's mostely useful when used as AT_EMPTY_PATH. umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the path and you need to resolve it with lookup restrictions. Then path resolution restrictions of openat2() can be used to get an fd. Which can be passed to umount(). I need to step outside so this is a halfway-out-the-door thought but given your description of the problem Jeff, why doesn't the following work (Just sketching this, you can't call openat2() like that.): fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED) umount("/proc/self/fd/fd_mnt", MNT_DETACH) > Converting mount-id to O_PATH... might be an interesting idea. I think using mount-ids would be nice and fwiw, something we considered as an alternative to umountat(). Not just can they be gotten from /proc/<pid>/mountinfo but we also do expose the mount id to userspace nowadays through: STATX_MNT_ID __u64 stx_mnt_id; which also came out of [2]. And it should be safe to do via AT_STATX_DONT_SYNC: statx(my_cached_fd, AT_NO_AUTOMOUNT|AT_SYMLINK_NOFOLLOW|AT_STATX_DONT_SYNC) using STATX_ATTR_MOUNT_ROOT to identify a potential mountpoint. Then pass that mount-id to the new system call. [2]: https://github.com/uapi-group/kernel-features ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 9:41 ` Christian Brauner @ 2023-04-14 10:09 ` Jeff Layton 2023-04-14 11:16 ` Christian Brauner 2023-04-15 9:51 ` Amir Goldstein 0 siblings, 2 replies; 46+ messages in thread From: Jeff Layton @ 2023-04-14 10:09 UTC (permalink / raw) To: Christian Brauner, Al Viro Cc: Trond Myklebust, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, 2023-04-14 at 11:41 +0200, Christian Brauner wrote: > On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote: > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote: > > > > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side? > > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor. > > > > You can already do umount -l /proc/self/fd/69 if you have a descriptor. > > Way back when we put together stuff for [2] we had umountat() as an item > but decided against it because it's mostely useful when used as AT_EMPTY_PATH. > > umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the > path and you need to resolve it with lookup restrictions. Then path > resolution restrictions of openat2() can be used to get an fd. Which can > be passed to umount(). > > I need to step outside so this is a halfway-out-the-door thought but > given your description of the problem Jeff, why doesn't the following > work (Just sketching this, you can't call openat2() like that.): > > fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED) > umount("/proc/self/fd/fd_mnt", MNT_DETACH) Something like that might work. A RESOLVE_CACHED flag is something that would involve more than just umount(2) though. That said, it could be useful in other situations. > > > Converting mount-id to O_PATH... might be an interesting idea. > > I think using mount-ids would be nice and fwiw, something we considered > as an alternative to umountat(). Not just can they be gotten from > /proc/<pid>/mountinfo but we also do expose the mount id to userspace > nowadays through: > > STATX_MNT_ID > __u64 stx_mnt_id; > > which also came out of [2]. And it should be safe to do via > AT_STATX_DONT_SYNC: > > statx(my_cached_fd, AT_NO_AUTOMOUNT|AT_SYMLINK_NOFOLLOW|AT_STATX_DONT_SYNC) > > using STATX_ATTR_MOUNT_ROOT to identify a potential mountpoint. Then > pass that mount-id to the new system call. > > [2]: https://github.com/uapi-group/kernel-features This is generating a lot of good ideas! Maybe we should plan to discuss this further at LSF/MM? -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 10:09 ` Jeff Layton @ 2023-04-14 11:16 ` Christian Brauner 2023-04-14 12:33 ` Jeff Layton 2023-04-15 9:51 ` Amir Goldstein 1 sibling, 1 reply; 46+ messages in thread From: Christian Brauner @ 2023-04-14 11:16 UTC (permalink / raw) To: Jeff Layton Cc: Al Viro, Trond Myklebust, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 06:09:46AM -0400, Jeff Layton wrote: > On Fri, 2023-04-14 at 11:41 +0200, Christian Brauner wrote: > > On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote: > > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote: > > > > > > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side? > > > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor. > > > > > > You can already do umount -l /proc/self/fd/69 if you have a descriptor. > > > > Way back when we put together stuff for [2] we had umountat() as an item > > but decided against it because it's mostely useful when used as AT_EMPTY_PATH. > > > > umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the > > path and you need to resolve it with lookup restrictions. Then path > > resolution restrictions of openat2() can be used to get an fd. Which can > > be passed to umount(). > > > > I need to step outside so this is a halfway-out-the-door thought but > > given your description of the problem Jeff, why doesn't the following > > work (Just sketching this, you can't call openat2() like that.): > > > > fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED) > > umount("/proc/self/fd/fd_mnt", MNT_DETACH) > > Something like that might work. A RESOLVE_CACHED flag is something that > would involve more than just umount(2) though. That said, it could be > useful in other situations. I think I introduced an ambiguity by accident. What I meant by "you can't call openat2() like that" is that it takes a struct open_how argument not just a simple flags argument. The flag I was talking about, RESOLVE_CACHED, does exist already. So it is already possible to use openat2() to resolve paths like that. See include/uapi/linux/openat2.h > > > > > > Converting mount-id to O_PATH... might be an interesting idea. > > > > I think using mount-ids would be nice and fwiw, something we considered > > as an alternative to umountat(). Not just can they be gotten from > > /proc/<pid>/mountinfo but we also do expose the mount id to userspace > > nowadays through: > > > > STATX_MNT_ID > > __u64 stx_mnt_id; > > > > which also came out of [2]. And it should be safe to do via > > AT_STATX_DONT_SYNC: > > > > statx(my_cached_fd, AT_NO_AUTOMOUNT|AT_SYMLINK_NOFOLLOW|AT_STATX_DONT_SYNC) > > > > using STATX_ATTR_MOUNT_ROOT to identify a potential mountpoint. Then > > pass that mount-id to the new system call. > > > > [2]: https://github.com/uapi-group/kernel-features > > This is generating a lot of good ideas! Maybe we should plan to discuss > this further at LSF/MM? Sure, happy to. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 11:16 ` Christian Brauner @ 2023-04-14 12:33 ` Jeff Layton 2023-04-14 12:51 ` Christian Brauner 0 siblings, 1 reply; 46+ messages in thread From: Jeff Layton @ 2023-04-14 12:33 UTC (permalink / raw) To: Christian Brauner, Al Viro Cc: Trond Myklebust, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, 2023-04-14 at 13:16 +0200, Christian Brauner wrote: > On Fri, Apr 14, 2023 at 06:09:46AM -0400, Jeff Layton wrote: > > On Fri, 2023-04-14 at 11:41 +0200, Christian Brauner wrote: > > > On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote: > > > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote: > > > > > > > > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side? > > > > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor. > > > > > > > > You can already do umount -l /proc/self/fd/69 if you have a descriptor. > > > > > > Way back when we put together stuff for [2] we had umountat() as an item > > > but decided against it because it's mostely useful when used as AT_EMPTY_PATH. > > > > > > umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the > > > path and you need to resolve it with lookup restrictions. Then path > > > resolution restrictions of openat2() can be used to get an fd. Which can > > > be passed to umount(). > > > > > > I need to step outside so this is a halfway-out-the-door thought but > > > given your description of the problem Jeff, why doesn't the following > > > work (Just sketching this, you can't call openat2() like that.): > > > > > > fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED) > > > umount("/proc/self/fd/fd_mnt", MNT_DETACH) > > > > Something like that might work. A RESOLVE_CACHED flag is something that > > would involve more than just umount(2) though. That said, it could be > > useful in other situations. > > I think I introduced an ambiguity by accident. What I meant by "you > can't call openat2() like that" is that it takes a struct open_how > argument not just a simple flags argument. > > The flag I was talking about, RESOLVE_CACHED, does exist already. So it > is already possible to use openat2() to resolve paths like that. See > include/uapi/linux/openat2.h > Oh, thanks! I haven't tried this yet, but I doubt it'd fix the problem David was reproducing. There, the problem is mostly permission checks during the pathwalk. It tries to contact the server to check the permission, but because it's unexported, EACCES bubbles up. Also, to follow up: I tried Al's suggestion of lazily unmounting the bad intermediate mount closest to the root, and it does seem to clean up child mounts as expected. So, it does look like there is recourse in current kernels, even if the method is a bit unintuitive. Still, it might be worth discussing at LSF or here on the list. It'd be nice if umount "just worked" in this situation. Is there any reason to do a full-blown d_revalidating and permission-checking pathwalk on umount, assuming the caller already has CAP_SYS_ADMIN or similar? -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 12:33 ` Jeff Layton @ 2023-04-14 12:51 ` Christian Brauner 0 siblings, 0 replies; 46+ messages in thread From: Christian Brauner @ 2023-04-14 12:51 UTC (permalink / raw) To: Jeff Layton Cc: Al Viro, Trond Myklebust, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 08:33:09AM -0400, Jeff Layton wrote: > On Fri, 2023-04-14 at 13:16 +0200, Christian Brauner wrote: > > On Fri, Apr 14, 2023 at 06:09:46AM -0400, Jeff Layton wrote: > > > On Fri, 2023-04-14 at 11:41 +0200, Christian Brauner wrote: > > > > On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote: > > > > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote: > > > > > > > > > > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side? > > > > > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor. > > > > > > > > > > You can already do umount -l /proc/self/fd/69 if you have a descriptor. > > > > > > > > Way back when we put together stuff for [2] we had umountat() as an item > > > > but decided against it because it's mostely useful when used as AT_EMPTY_PATH. > > > > > > > > umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the > > > > path and you need to resolve it with lookup restrictions. Then path > > > > resolution restrictions of openat2() can be used to get an fd. Which can > > > > be passed to umount(). > > > > > > > > I need to step outside so this is a halfway-out-the-door thought but > > > > given your description of the problem Jeff, why doesn't the following > > > > work (Just sketching this, you can't call openat2() like that.): > > > > > > > > fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED) > > > > umount("/proc/self/fd/fd_mnt", MNT_DETACH) > > > > > > Something like that might work. A RESOLVE_CACHED flag is something that > > > would involve more than just umount(2) though. That said, it could be > > > useful in other situations. > > > > I think I introduced an ambiguity by accident. What I meant by "you > > can't call openat2() like that" is that it takes a struct open_how > > argument not just a simple flags argument. > > > > The flag I was talking about, RESOLVE_CACHED, does exist already. So it > > is already possible to use openat2() to resolve paths like that. See > > include/uapi/linux/openat2.h > > > > Oh, thanks! I haven't tried this yet, but I doubt it'd fix the problem > David was reproducing. There, the problem is mostly permission checks > during the pathwalk. It tries to contact the server to check the > permission, but because it's unexported, EACCES bubbles up. > > Also, to follow up: I tried Al's suggestion of lazily unmounting the bad > intermediate mount closest to the root, and it does seem to clean up > child mounts as expected. So, it does look like there is recourse in > current kernels, even if the method is a bit unintuitive. See my other mail about this. Lazy umount is often the way to go. In fact, there'll be mount trees that you cannot unmount without MNT_DETACH because of mount propagation. > > Still, it might be worth discussing at LSF or here on the list. It'd be > nice if umount "just worked" in this situation. Is there any reason to > do a full-blown d_revalidating and permission-checking pathwalk on > umount, assuming the caller already has CAP_SYS_ADMIN or similar? I find this to be conceptually the wrong way of doing this. Because even if you can piggy-back on LOOKUP_CACHED you probably still need special cases handling for umount. And at that point I think a umount by mnt-id is just nicer. Because you would also need to expose the new umount() behavior under a new flag otherwise you risk breaking userspace assumptions. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 10:09 ` Jeff Layton 2023-04-14 11:16 ` Christian Brauner @ 2023-04-15 9:51 ` Amir Goldstein 1 sibling, 0 replies; 46+ messages in thread From: Amir Goldstein @ 2023-04-15 9:51 UTC (permalink / raw) To: Jeff Layton Cc: Christian Brauner, Al Viro, Trond Myklebust, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig, Josef Bacik On Fri, Apr 14, 2023 at 1:13 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Fri, 2023-04-14 at 11:41 +0200, Christian Brauner wrote: > > On Fri, Apr 14, 2023 at 04:51:04AM +0100, Al Viro wrote: > > > On Fri, Apr 14, 2023 at 03:28:45AM +0000, Trond Myklebust wrote: > > > > > > > We already have support for directory file descriptors when mounting with move_mount(). Why not add a umountat() with similar support for the unmount side? > > > > Then add a syscall to allow users with (e.g.) the CAP_DAC_OVERRIDE privilege to convert the mount-id into an O_PATH file descriptor. > > > > > > You can already do umount -l /proc/self/fd/69 if you have a descriptor. > > > > Way back when we put together stuff for [2] we had umountat() as an item > > but decided against it because it's mostely useful when used as AT_EMPTY_PATH. > > > > umount("/proc/self/fd/<nr>", ...) is useful when you don't trust the > > path and you need to resolve it with lookup restrictions. Then path > > resolution restrictions of openat2() can be used to get an fd. Which can > > be passed to umount(). > > > > I need to step outside so this is a halfway-out-the-door thought but > > given your description of the problem Jeff, why doesn't the following > > work (Just sketching this, you can't call openat2() like that.): > > > > fd_mnt = openat2("/my/funky/nfs/share/mount", RESOLVE_CACHED) > > umount("/proc/self/fd/fd_mnt", MNT_DETACH) > > Something like that might work. A RESOLVE_CACHED flag is something that > would involve more than just umount(2) though. That said, it could be > useful in other situations. > > > > > > Converting mount-id to O_PATH... might be an interesting idea. > > > > I think using mount-ids would be nice and fwiw, something we considered > > as an alternative to umountat(). Not just can they be gotten from > > /proc/<pid>/mountinfo but we also do expose the mount id to userspace > > nowadays through: > > > > STATX_MNT_ID > > __u64 stx_mnt_id; > > > > which also came out of [2]. And it should be safe to do via > > AT_STATX_DONT_SYNC: > > > > statx(my_cached_fd, AT_NO_AUTOMOUNT|AT_SYMLINK_NOFOLLOW|AT_STATX_DONT_SYNC) > > > > using STATX_ATTR_MOUNT_ROOT to identify a potential mountpoint. Then > > pass that mount-id to the new system call. > > > > [2]: https://github.com/uapi-group/kernel-features > > This is generating a lot of good ideas! Maybe we should plan to discuss > this further at LSF/MM? > Hi Jeff, I am trying to collect the topics for LSF/MM FS sessions, but it is somewhat hard to do without an official [TOPIC] suggestion. Not sure if this specific thread has anything left to discuss in a session or if the original SUBJECT still describes the wider topic accurately. Could you please follow up with a [TOPIC] proposal or just let me know 1. That you are interested to lead the session 2. Descriptive title for the session to put in the schedule 3. lore link to put in the schedule While at it, please provide me with this info regarding "i_version improvements". Thanks, Amir. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 2:43 ` Al Viro 2023-04-14 3:28 ` Trond Myklebust @ 2023-04-14 10:06 ` Jeff Layton 2023-04-14 13:41 ` Christian Brauner 1 sibling, 1 reply; 46+ messages in thread From: Jeff Layton @ 2023-04-14 10:06 UTC (permalink / raw) To: Al Viro, NeilBrown Cc: Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote: > On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: > > > The path name that appears in /proc/mounts is the key that must be used > > to find and unmount a filesystem. When you do that "find"ing you are > > not looking up a name in a filesystem, you are looking up a key in the > > mount table. > > No. The path name in /proc/mounts is *NOT* a key - it's a best-effort > attempt to describe the mountpoint. Pathname resolution does not work > in terms of "the longest prefix is found and we handle the rest within > that filesystem". > > > We could, instead, create an api that is given a mount-id (first number > > in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could > > read /proc/self/mountinfo, find the mount-id, and unmount it - all > > without ever doing path name lookup in the traditional sense. > > > > But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed > > LOOKUP_CACHED, and it only finds paths that are in the dcache, never > > revalidates, at most performs simple permission checks based on cached > > content. > > umount /proc/self/fd/42/barf/something > Does any of that involve talking to the server? I don't necessarily see a problem with doing the above. If "something" is in cache then that should still work. The main idea here is that we want to avoid communicating with the backing store during the umount(2) pathwalk. > Discuss. > > OTON, umount-by-mount-id is an interesting idea, but we'll need to decide > what would the right permissions be for it. > > But please, lose the "mount table is a mapping from path prefix to filesystem" > notion - it really, really is not. IIRC, there are systems that work that way, > but it's nowhere near the semantics used by any Unices, all variants of Linux > included. I'm not opposed to something by umount-by-mount-id either. All of this seems like something that should probably rely on CAP_SYS_ADMIN. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 10:06 ` Jeff Layton @ 2023-04-14 13:41 ` Christian Brauner 2023-04-14 14:21 ` Trond Myklebust 0 siblings, 1 reply; 46+ messages in thread From: Christian Brauner @ 2023-04-14 13:41 UTC (permalink / raw) To: Jeff Layton Cc: Al Viro, NeilBrown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote: > On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote: > > On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: > > > > > The path name that appears in /proc/mounts is the key that must be used > > > to find and unmount a filesystem. When you do that "find"ing you are > > > not looking up a name in a filesystem, you are looking up a key in the > > > mount table. > > > > No. The path name in /proc/mounts is *NOT* a key - it's a best-effort > > attempt to describe the mountpoint. Pathname resolution does not work > > in terms of "the longest prefix is found and we handle the rest within > > that filesystem". > > > > > We could, instead, create an api that is given a mount-id (first number > > > in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could > > > read /proc/self/mountinfo, find the mount-id, and unmount it - all > > > without ever doing path name lookup in the traditional sense. > > > > > > But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed > > > LOOKUP_CACHED, and it only finds paths that are in the dcache, never > > > revalidates, at most performs simple permission checks based on cached > > > content. > > > > umount /proc/self/fd/42/barf/something > > > > Does any of that involve talking to the server? I don't necessarily see > a problem with doing the above. If "something" is in cache then that > should still work. > > The main idea here is that we want to avoid communicating with the > backing store during the umount(2) pathwalk. > > > Discuss. > > > > OTON, umount-by-mount-id is an interesting idea, but we'll need to decide > > what would the right permissions be for it. > > > > But please, lose the "mount table is a mapping from path prefix to filesystem" > > notion - it really, really is not. IIRC, there are systems that work that way, > > but it's nowhere near the semantics used by any Unices, all variants of Linux > > included. > > I'm not opposed to something by umount-by-mount-id either. All of this > seems like something that should probably rely on CAP_SYS_ADMIN. The permission model needs to account for the fact that mount ids are global and as such you could in principle unmount any mount in any mount namespace. IOW, you can circumvent lookup restrictions completely. So we could resolve the mnt-id to an FMODE_PATH and then very roughly with no claim to solving everything: may_umount_by_mnt_id(struct path *opath) { struct path root; bool reachable; // caller in principle able to circumvent lookup restrictions if (!may_cap_dac_readsearch()) return false; // caller can mount in their mountns if (!may_mount()) return false; // target mount and caller in the same mountns if (!check_mnt()) return false; // caller could in principle reach mount from it's root get_fs_root(current->fs, &root); reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root); path_put(&root); return reachable; } However, that still means that we have laxer restrictions on unmounting by mount-id then on unmount with lookup as for lookup just having CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems without custom permission handlers - we also establish that the inode can be mapped into the caller's idmapping. So that would mean that unmounting by mount-id would allow you to unmount mounts in cases where you wouldn't with umount. That might be fine though as that's ultimately the goal here in a way. One could also see a very useful feature in this where you require capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow unmounting any mount in the system by mount-id. This would obviously be very useful for privileged service managers but I haven't thought this through. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 13:41 ` Christian Brauner @ 2023-04-14 14:21 ` Trond Myklebust 2023-04-14 15:13 ` Christian Brauner 0 siblings, 1 reply; 46+ messages in thread From: Trond Myklebust @ 2023-04-14 14:21 UTC (permalink / raw) To: Christian Brauner Cc: Jeffrey Layton, Alexander Viro, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig > On Apr 14, 2023, at 09:41, Christian Brauner <brauner@kernel.org> wrote: > > On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote: >> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote: >>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: >>> >>>> The path name that appears in /proc/mounts is the key that must be used >>>> to find and unmount a filesystem. When you do that "find"ing you are >>>> not looking up a name in a filesystem, you are looking up a key in the >>>> mount table. >>> >>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort >>> attempt to describe the mountpoint. Pathname resolution does not work >>> in terms of "the longest prefix is found and we handle the rest within >>> that filesystem". >>> >>>> We could, instead, create an api that is given a mount-id (first number >>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could >>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all >>>> without ever doing path name lookup in the traditional sense. >>>> >>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed >>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never >>>> revalidates, at most performs simple permission checks based on cached >>>> content. >>> >>> umount /proc/self/fd/42/barf/something >>> >> >> Does any of that involve talking to the server? I don't necessarily see >> a problem with doing the above. If "something" is in cache then that >> should still work. >> >> The main idea here is that we want to avoid communicating with the >> backing store during the umount(2) pathwalk. >> >>> Discuss. >>> >>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide >>> what would the right permissions be for it. >>> >>> But please, lose the "mount table is a mapping from path prefix to filesystem" >>> notion - it really, really is not. IIRC, there are systems that work that way, >>> but it's nowhere near the semantics used by any Unices, all variants of Linux >>> included. >> >> I'm not opposed to something by umount-by-mount-id either. All of this >> seems like something that should probably rely on CAP_SYS_ADMIN. > > The permission model needs to account for the fact that mount ids are > global and as such you could in principle unmount any mount in any mount > namespace. IOW, you can circumvent lookup restrictions completely. > > So we could resolve the mnt-id to an FMODE_PATH and then very roughly > with no claim to solving everything: > > may_umount_by_mnt_id(struct path *opath) > { > struct path root; > bool reachable; > > // caller in principle able to circumvent lookup restrictions > if (!may_cap_dac_readsearch()) > return false; > > // caller can mount in their mountns > if (!may_mount()) > return false; > > // target mount and caller in the same mountns > if (!check_mnt()) > return false; > > // caller could in principle reach mount from it's root > get_fs_root(current->fs, &root); > reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root); > path_put(&root); > > return reachable; > } > > However, that still means that we have laxer restrictions on unmounting > by mount-id then on unmount with lookup as for lookup just having > CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems > without custom permission handlers - we also establish that the inode > can be mapped into the caller's idmapping. > > So that would mean that unmounting by mount-id would allow you to > unmount mounts in cases where you wouldn't with umount. That might be > fine though as that's ultimately the goal here in a way. > > One could also see a very useful feature in this where you require > capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow > unmounting any mount in the system by mount-id. This would obviously be > very useful for privileged service managers but I haven't thought this > Through. That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege. The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege. Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN. _________________________________ Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 14:21 ` Trond Myklebust @ 2023-04-14 15:13 ` Christian Brauner 2023-04-14 15:30 ` Trond Myklebust 0 siblings, 1 reply; 46+ messages in thread From: Christian Brauner @ 2023-04-14 15:13 UTC (permalink / raw) To: Trond Myklebust Cc: Jeffrey Layton, Alexander Viro, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote: > > > > On Apr 14, 2023, at 09:41, Christian Brauner <brauner@kernel.org> wrote: > > > > On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote: > >> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote: > >>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: > >>> > >>>> The path name that appears in /proc/mounts is the key that must be used > >>>> to find and unmount a filesystem. When you do that "find"ing you are > >>>> not looking up a name in a filesystem, you are looking up a key in the > >>>> mount table. > >>> > >>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort > >>> attempt to describe the mountpoint. Pathname resolution does not work > >>> in terms of "the longest prefix is found and we handle the rest within > >>> that filesystem". > >>> > >>>> We could, instead, create an api that is given a mount-id (first number > >>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could > >>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all > >>>> without ever doing path name lookup in the traditional sense. > >>>> > >>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed > >>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never > >>>> revalidates, at most performs simple permission checks based on cached > >>>> content. > >>> > >>> umount /proc/self/fd/42/barf/something > >>> > >> > >> Does any of that involve talking to the server? I don't necessarily see > >> a problem with doing the above. If "something" is in cache then that > >> should still work. > >> > >> The main idea here is that we want to avoid communicating with the > >> backing store during the umount(2) pathwalk. > >> > >>> Discuss. > >>> > >>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide > >>> what would the right permissions be for it. > >>> > >>> But please, lose the "mount table is a mapping from path prefix to filesystem" > >>> notion - it really, really is not. IIRC, there are systems that work that way, > >>> but it's nowhere near the semantics used by any Unices, all variants of Linux > >>> included. > >> > >> I'm not opposed to something by umount-by-mount-id either. All of this > >> seems like something that should probably rely on CAP_SYS_ADMIN. > > > > The permission model needs to account for the fact that mount ids are > > global and as such you could in principle unmount any mount in any mount > > namespace. IOW, you can circumvent lookup restrictions completely. > > > > So we could resolve the mnt-id to an FMODE_PATH and then very roughly > > with no claim to solving everything: > > > > may_umount_by_mnt_id(struct path *opath) > > { > > struct path root; > > bool reachable; > > > > // caller in principle able to circumvent lookup restrictions > > if (!may_cap_dac_readsearch()) > > return false; > > > > // caller can mount in their mountns > > if (!may_mount()) > > return false; > > > > // target mount and caller in the same mountns > > if (!check_mnt()) > > return false; > > > > // caller could in principle reach mount from it's root > > get_fs_root(current->fs, &root); > > reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root); > > path_put(&root); > > > > return reachable; > > } > > > > However, that still means that we have laxer restrictions on unmounting > > by mount-id then on unmount with lookup as for lookup just having > > CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems > > without custom permission handlers - we also establish that the inode > > can be mapped into the caller's idmapping. > > > > So that would mean that unmounting by mount-id would allow you to > > unmount mounts in cases where you wouldn't with umount. That might be > > fine though as that's ultimately the goal here in a way. > > > > One could also see a very useful feature in this where you require > > capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow > > unmounting any mount in the system by mount-id. This would obviously be > > very useful for privileged service managers but I haven't thought this > > Through. > > That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege. > > The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege. > > Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN. There's a difference between unmounting directly by providing a mount id and getting an O_PATH file descriptor from a mnt-id. If you can simply unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH in a container. Without it you likely need to require capable(CAP_DAC_READ_SEARCH) aka system level privileges just like open_to_handle_at() which makes this interface way less generic and usable. Otherwise you'd be able to get an O_PATH fd to something that you wouldn't be able to access through normal path lookup. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 15:13 ` Christian Brauner @ 2023-04-14 15:30 ` Trond Myklebust 2023-04-14 15:57 ` Trond Myklebust 2023-04-14 16:32 ` Christian Brauner 0 siblings, 2 replies; 46+ messages in thread From: Trond Myklebust @ 2023-04-14 15:30 UTC (permalink / raw) To: Christian Brauner Cc: Jeffrey Layton, Alexander Viro, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig > On Apr 14, 2023, at 11:13, Christian Brauner <brauner@kernel.org> wrote: > > On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote: >> >> >>> On Apr 14, 2023, at 09:41, Christian Brauner <brauner@kernel.org> wrote: >>> >>> On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote: >>>> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote: >>>>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: >>>>> >>>>>> The path name that appears in /proc/mounts is the key that must be used >>>>>> to find and unmount a filesystem. When you do that "find"ing you are >>>>>> not looking up a name in a filesystem, you are looking up a key in the >>>>>> mount table. >>>>> >>>>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort >>>>> attempt to describe the mountpoint. Pathname resolution does not work >>>>> in terms of "the longest prefix is found and we handle the rest within >>>>> that filesystem". >>>>> >>>>>> We could, instead, create an api that is given a mount-id (first number >>>>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could >>>>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all >>>>>> without ever doing path name lookup in the traditional sense. >>>>>> >>>>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed >>>>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never >>>>>> revalidates, at most performs simple permission checks based on cached >>>>>> content. >>>>> >>>>> umount /proc/self/fd/42/barf/something >>>>> >>>> >>>> Does any of that involve talking to the server? I don't necessarily see >>>> a problem with doing the above. If "something" is in cache then that >>>> should still work. >>>> >>>> The main idea here is that we want to avoid communicating with the >>>> backing store during the umount(2) pathwalk. >>>> >>>>> Discuss. >>>>> >>>>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide >>>>> what would the right permissions be for it. >>>>> >>>>> But please, lose the "mount table is a mapping from path prefix to filesystem" >>>>> notion - it really, really is not. IIRC, there are systems that work that way, >>>>> but it's nowhere near the semantics used by any Unices, all variants of Linux >>>>> included. >>>> >>>> I'm not opposed to something by umount-by-mount-id either. All of this >>>> seems like something that should probably rely on CAP_SYS_ADMIN. >>> >>> The permission model needs to account for the fact that mount ids are >>> global and as such you could in principle unmount any mount in any mount >>> namespace. IOW, you can circumvent lookup restrictions completely. >>> >>> So we could resolve the mnt-id to an FMODE_PATH and then very roughly >>> with no claim to solving everything: >>> >>> may_umount_by_mnt_id(struct path *opath) >>> { >>> struct path root; >>> bool reachable; >>> >>> // caller in principle able to circumvent lookup restrictions >>> if (!may_cap_dac_readsearch()) >>> return false; >>> >>> // caller can mount in their mountns >>> if (!may_mount()) >>> return false; >>> >>> // target mount and caller in the same mountns >>> if (!check_mnt()) >>> return false; >>> >>> // caller could in principle reach mount from it's root >>> get_fs_root(current->fs, &root); >>> reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root); >>> path_put(&root); >>> >>> return reachable; >>> } >>> >>> However, that still means that we have laxer restrictions on unmounting >>> by mount-id then on unmount with lookup as for lookup just having >>> CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems >>> without custom permission handlers - we also establish that the inode >>> can be mapped into the caller's idmapping. >>> >>> So that would mean that unmounting by mount-id would allow you to >>> unmount mounts in cases where you wouldn't with umount. That might be >>> fine though as that's ultimately the goal here in a way. >>> >>> One could also see a very useful feature in this where you require >>> capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow >>> unmounting any mount in the system by mount-id. This would obviously be >>> very useful for privileged service managers but I haven't thought this >>> Through. >> >> That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege. >> >> The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege. >> >> Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN. > > There's a difference between unmounting directly by providing a mount id > and getting an O_PATH file descriptor from a mnt-id. If you can simply > unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH > in a container. Without it you likely need to require > capable(CAP_DAC_READ_SEARCH) aka system level privileges just like > open_to_handle_at() which makes this interface way less generic and > usable. Otherwise you'd be able to get an O_PATH fd to something that > you wouldn't be able to access through normal path lookup. Being able to convert into an O_PATH descriptor gives you more options than just unmounting. It should allow you to syncfs() before unmounting. It should allow you to call open_tree() so you can manipulate the filesystem that is no longer accessible by path walk (e.g. so you can bind it elsewhere or move it). _________________________________ Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 15:30 ` Trond Myklebust @ 2023-04-14 15:57 ` Trond Myklebust 2023-04-14 16:22 ` Al Viro 2023-04-14 16:32 ` Christian Brauner 1 sibling, 1 reply; 46+ messages in thread From: Trond Myklebust @ 2023-04-14 15:57 UTC (permalink / raw) To: Christian Brauner Cc: Jeffrey Layton, Alexander Viro, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig > On Apr 14, 2023, at 11:30, Trond Myklebust <trondmy@hammerspace.com> wrote: > > > >> On Apr 14, 2023, at 11:13, Christian Brauner <brauner@kernel.org> wrote: >> >> On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote: >>> >>> >>>> On Apr 14, 2023, at 09:41, Christian Brauner <brauner@kernel.org> wrote: >>>> >>>> On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote: >>>>> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote: >>>>>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: >>>>>> >>>>>>> The path name that appears in /proc/mounts is the key that must be used >>>>>>> to find and unmount a filesystem. When you do that "find"ing you are >>>>>>> not looking up a name in a filesystem, you are looking up a key in the >>>>>>> mount table. >>>>>> >>>>>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort >>>>>> attempt to describe the mountpoint. Pathname resolution does not work >>>>>> in terms of "the longest prefix is found and we handle the rest within >>>>>> that filesystem". >>>>>> >>>>>>> We could, instead, create an api that is given a mount-id (first number >>>>>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could >>>>>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all >>>>>>> without ever doing path name lookup in the traditional sense. >>>>>>> >>>>>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed >>>>>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never >>>>>>> revalidates, at most performs simple permission checks based on cached >>>>>>> content. >>>>>> >>>>>> umount /proc/self/fd/42/barf/something >>>>>> >>>>> >>>>> Does any of that involve talking to the server? I don't necessarily see >>>>> a problem with doing the above. If "something" is in cache then that >>>>> should still work. >>>>> >>>>> The main idea here is that we want to avoid communicating with the >>>>> backing store during the umount(2) pathwalk. >>>>> >>>>>> Discuss. >>>>>> >>>>>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide >>>>>> what would the right permissions be for it. >>>>>> >>>>>> But please, lose the "mount table is a mapping from path prefix to filesystem" >>>>>> notion - it really, really is not. IIRC, there are systems that work that way, >>>>>> but it's nowhere near the semantics used by any Unices, all variants of Linux >>>>>> included. >>>>> >>>>> I'm not opposed to something by umount-by-mount-id either. All of this >>>>> seems like something that should probably rely on CAP_SYS_ADMIN. >>>> >>>> The permission model needs to account for the fact that mount ids are >>>> global and as such you could in principle unmount any mount in any mount >>>> namespace. IOW, you can circumvent lookup restrictions completely. >>>> >>>> So we could resolve the mnt-id to an FMODE_PATH and then very roughly >>>> with no claim to solving everything: >>>> >>>> may_umount_by_mnt_id(struct path *opath) >>>> { >>>> struct path root; >>>> bool reachable; >>>> >>>> // caller in principle able to circumvent lookup restrictions >>>> if (!may_cap_dac_readsearch()) >>>> return false; >>>> >>>> // caller can mount in their mountns >>>> if (!may_mount()) >>>> return false; >>>> >>>> // target mount and caller in the same mountns >>>> if (!check_mnt()) >>>> return false; >>>> >>>> // caller could in principle reach mount from it's root >>>> get_fs_root(current->fs, &root); >>>> reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root); >>>> path_put(&root); >>>> >>>> return reachable; >>>> } >>>> >>>> However, that still means that we have laxer restrictions on unmounting >>>> by mount-id then on unmount with lookup as for lookup just having >>>> CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems >>>> without custom permission handlers - we also establish that the inode >>>> can be mapped into the caller's idmapping. >>>> >>>> So that would mean that unmounting by mount-id would allow you to >>>> unmount mounts in cases where you wouldn't with umount. That might be >>>> fine though as that's ultimately the goal here in a way. >>>> >>>> One could also see a very useful feature in this where you require >>>> capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow >>>> unmounting any mount in the system by mount-id. This would obviously be >>>> very useful for privileged service managers but I haven't thought this >>>> Through. >>> >>> That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege. >>> >>> The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege. >>> >>> Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN. >> >> There's a difference between unmounting directly by providing a mount id >> and getting an O_PATH file descriptor from a mnt-id. If you can simply >> unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH >> in a container. Without it you likely need to require >> capable(CAP_DAC_READ_SEARCH) aka system level privileges just like >> open_to_handle_at() which makes this interface way less generic and >> usable. Otherwise you'd be able to get an O_PATH fd to something that >> you wouldn't be able to access through normal path lookup. > > > Being able to convert into an O_PATH descriptor gives you more options than just unmounting. It should allow you to syncfs() before unmounting. It should allow you to call open_tree() so you can manipulate the filesystem that is no longer accessible by path walk (e.g. so you can bind it elsewhere or move it). > One more thing it might allow us to do, which I’ve been wanting for a while in NFS: allow us to flip the mount type from being “hard” to “soft” before doing the lazy unmount, so that any application that might still retry I/O after the call to umount_begin() completes will start timing out with an I/O error, and free up the resources it might otherwise hold forever. _________________________________ Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 15:57 ` Trond Myklebust @ 2023-04-14 16:22 ` Al Viro 2023-04-14 16:41 ` Trond Myklebust 0 siblings, 1 reply; 46+ messages in thread From: Al Viro @ 2023-04-14 16:22 UTC (permalink / raw) To: Trond Myklebust Cc: Christian Brauner, Jeffrey Layton, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 03:57:34PM +0000, Trond Myklebust wrote: > > > > Being able to convert into an O_PATH descriptor gives you more options than just unmounting. It should allow you to syncfs() before unmounting. It should allow you to call open_tree() so you can manipulate the filesystem that is no longer accessible by path walk (e.g. so you can bind it elsewhere or move it). > > > > One more thing it might allow us to do, which I’ve been wanting for a while in NFS: allow us to flip the mount type from being “hard” to “soft” before doing the lazy unmount, so that any application that might still retry I/O after the call to umount_begin() completes will start timing out with an I/O error, and free up the resources it might otherwise hold forever. > s/lazy/forced/, surely? Confused... Note, BTW, that hard vs. soft is a property of fs instance; if you have it present elsewhere in the mount tree, flipping it would affect all such places. I don't see any good way to make it a per-mount thing, TBH... ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 16:22 ` Al Viro @ 2023-04-14 16:41 ` Trond Myklebust 2023-04-14 19:01 ` Benjamin Coddington 0 siblings, 1 reply; 46+ messages in thread From: Trond Myklebust @ 2023-04-14 16:41 UTC (permalink / raw) To: Alexander Viro Cc: Christian Brauner, Jeffrey Layton, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig > On Apr 14, 2023, at 12:22, Al Viro <viro@ZenIV.linux.org.uk> wrote: > > On Fri, Apr 14, 2023 at 03:57:34PM +0000, Trond Myklebust wrote: > >>> >>> Being able to convert into an O_PATH descriptor gives you more options than just unmounting. It should allow you to syncfs() before unmounting. It should allow you to call open_tree() so you can manipulate the filesystem that is no longer accessible by path walk (e.g. so you can bind it elsewhere or move it). >>> >> >> One more thing it might allow us to do, which I’ve been wanting for a while in NFS: allow us to flip the mount type from being “hard” to “soft” before doing the lazy unmount, so that any application that might still retry I/O after the call to umount_begin() completes will start timing out with an I/O error, and free up the resources it might otherwise hold forever. >> > > s/lazy/forced/, surely? Confused... I mean both cases. Doing a lazy umount with a hard mounted filesystem is a risk sport: if the server does become permanently borked, you can fill up your page cache with stuff that can’t be evicted. Most users don’t realise this, so they get confused when it happens (particularly since the filesystem is out-of-sight and hence out-of-mind). > > Note, BTW, that hard vs. soft is a property of fs instance; if you have > it present elsewhere in the mount tree, flipping it would affect all > such places. I don't see any good way to make it a per-mount thing, TBH… The main use case is for when the server is permanently down, so normally it shouldn’t be a problem with flipping the mode on all instances. That said, it might be nice to make it per-mountpoint at some time. We do have the ability to declare individual RPC calls to time out, so it’s doable at the RPC level. All we would really need is the ability to store a per-vfsmount flag. _________________________________ Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 16:41 ` Trond Myklebust @ 2023-04-14 19:01 ` Benjamin Coddington 2023-04-17 8:22 ` Christian Brauner 0 siblings, 1 reply; 46+ messages in thread From: Benjamin Coddington @ 2023-04-14 19:01 UTC (permalink / raw) To: Trond Myklebust Cc: Alexander Viro, Christian Brauner, Jeffrey Layton, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On 14 Apr 2023, at 12:41, Trond Myklebust wrote: > > I mean both cases. Doing a lazy umount with a hard mounted filesystem is a risk sport: if the server does become permanently borked, you can fill up your page cache with stuff that can’t be evicted. Most users don’t realise this, so they get confused when it happens (particularly since the filesystem is out-of-sight and hence out-of-mind). I've been pecking away at a sysfs knob for this case. Seemed a clearer path to destruction. >> >> Note, BTW, that hard vs. soft is a property of fs instance; if you have >> it present elsewhere in the mount tree, flipping it would affect all >> such places. I don't see any good way to make it a per-mount thing, TBH… > > > The main use case is for when the server is permanently down, so normally it shouldn’t be a problem with flipping the mode on all instances. Is there another case? Because, if so.. > That said, it might be nice to make it per-mountpoint at some time. We do have the ability to declare individual RPC calls to time out, so it’s doable at the RPC level. All we would really need is the ability to store a per-vfsmount flag. .. maybe vfsmount's mnt_root d_fsdata? Ben ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 19:01 ` Benjamin Coddington @ 2023-04-17 8:22 ` Christian Brauner 0 siblings, 0 replies; 46+ messages in thread From: Christian Brauner @ 2023-04-17 8:22 UTC (permalink / raw) To: Benjamin Coddington Cc: Trond Myklebust, Alexander Viro, Jeffrey Layton, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 03:01:01PM -0400, Benjamin Coddington wrote: > On 14 Apr 2023, at 12:41, Trond Myklebust wrote: > > > > I mean both cases. Doing a lazy umount with a hard mounted filesystem is a risk sport: if the server does become permanently borked, you can fill up your page cache with stuff that can’t be evicted. Most users don’t realise this, so they get confused when it happens (particularly since the filesystem is out-of-sight and hence out-of-mind). > > I've been pecking away at a sysfs knob for this case. Seemed a clearer path to destruction. > > >> > >> Note, BTW, that hard vs. soft is a property of fs instance; if you have > >> it present elsewhere in the mount tree, flipping it would affect all > >> such places. I don't see any good way to make it a per-mount thing, TBH… > > > > > > The main use case is for when the server is permanently down, so normally it shouldn’t be a problem with flipping the mode on all instances. > > Is there another case? Because, if so.. > > > That said, it might be nice to make it per-mountpoint at some time. > > We do have the ability to declare individual RPC calls to time out, > > so it’s doable at the RPC level. All we would really need is the > > ability to store a per-vfsmount flag. I would very much like to avoid having filesystem specific data in struct vfsmount. That sounds like a maintenance nightmare going forward. Mount structures should remain pure vfs constructs that only carry generic properties imho. > > .. maybe vfsmount's mnt_root d_fsdata? I don't think that would work without having access to the vfsmount. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 15:30 ` Trond Myklebust 2023-04-14 15:57 ` Trond Myklebust @ 2023-04-14 16:32 ` Christian Brauner 1 sibling, 0 replies; 46+ messages in thread From: Christian Brauner @ 2023-04-14 16:32 UTC (permalink / raw) To: Trond Myklebust Cc: Jeffrey Layton, Alexander Viro, Neil Brown, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 14, 2023 at 03:30:30PM +0000, Trond Myklebust wrote: > > > > On Apr 14, 2023, at 11:13, Christian Brauner <brauner@kernel.org> wrote: > > > > On Fri, Apr 14, 2023 at 02:21:00PM +0000, Trond Myklebust wrote: > >> > >> > >>> On Apr 14, 2023, at 09:41, Christian Brauner <brauner@kernel.org> wrote: > >>> > >>> On Fri, Apr 14, 2023 at 06:06:38AM -0400, Jeff Layton wrote: > >>>> On Fri, 2023-04-14 at 03:43 +0100, Al Viro wrote: > >>>>> On Fri, Apr 14, 2023 at 08:41:03AM +1000, NeilBrown wrote: > >>>>> > >>>>>> The path name that appears in /proc/mounts is the key that must be used > >>>>>> to find and unmount a filesystem. When you do that "find"ing you are > >>>>>> not looking up a name in a filesystem, you are looking up a key in the > >>>>>> mount table. > >>>>> > >>>>> No. The path name in /proc/mounts is *NOT* a key - it's a best-effort > >>>>> attempt to describe the mountpoint. Pathname resolution does not work > >>>>> in terms of "the longest prefix is found and we handle the rest within > >>>>> that filesystem". > >>>>> > >>>>>> We could, instead, create an api that is given a mount-id (first number > >>>>>> in /proc/self/mountinfo) and unmounts that. Then /sbin/umount could > >>>>>> read /proc/self/mountinfo, find the mount-id, and unmount it - all > >>>>>> without ever doing path name lookup in the traditional sense. > >>>>>> > >>>>>> But I prefer your suggestion. LOOKUP_MOUNTPOINT could be renamed > >>>>>> LOOKUP_CACHED, and it only finds paths that are in the dcache, never > >>>>>> revalidates, at most performs simple permission checks based on cached > >>>>>> content. > >>>>> > >>>>> umount /proc/self/fd/42/barf/something > >>>>> > >>>> > >>>> Does any of that involve talking to the server? I don't necessarily see > >>>> a problem with doing the above. If "something" is in cache then that > >>>> should still work. > >>>> > >>>> The main idea here is that we want to avoid communicating with the > >>>> backing store during the umount(2) pathwalk. > >>>> > >>>>> Discuss. > >>>>> > >>>>> OTON, umount-by-mount-id is an interesting idea, but we'll need to decide > >>>>> what would the right permissions be for it. > >>>>> > >>>>> But please, lose the "mount table is a mapping from path prefix to filesystem" > >>>>> notion - it really, really is not. IIRC, there are systems that work that way, > >>>>> but it's nowhere near the semantics used by any Unices, all variants of Linux > >>>>> included. > >>>> > >>>> I'm not opposed to something by umount-by-mount-id either. All of this > >>>> seems like something that should probably rely on CAP_SYS_ADMIN. > >>> > >>> The permission model needs to account for the fact that mount ids are > >>> global and as such you could in principle unmount any mount in any mount > >>> namespace. IOW, you can circumvent lookup restrictions completely. > >>> > >>> So we could resolve the mnt-id to an FMODE_PATH and then very roughly > >>> with no claim to solving everything: > >>> > >>> may_umount_by_mnt_id(struct path *opath) > >>> { > >>> struct path root; > >>> bool reachable; > >>> > >>> // caller in principle able to circumvent lookup restrictions > >>> if (!may_cap_dac_readsearch()) > >>> return false; > >>> > >>> // caller can mount in their mountns > >>> if (!may_mount()) > >>> return false; > >>> > >>> // target mount and caller in the same mountns > >>> if (!check_mnt()) > >>> return false; > >>> > >>> // caller could in principle reach mount from it's root > >>> get_fs_root(current->fs, &root); > >>> reachable = is_path_reachable(real_mount(opath->mnt), opath->dentry, &root); > >>> path_put(&root); > >>> > >>> return reachable; > >>> } > >>> > >>> However, that still means that we have laxer restrictions on unmounting > >>> by mount-id then on unmount with lookup as for lookup just having > >>> CAP_DAC_READ_SEARCH isn't enough. Usually - at least for filesytems > >>> without custom permission handlers - we also establish that the inode > >>> can be mapped into the caller's idmapping. > >>> > >>> So that would mean that unmounting by mount-id would allow you to > >>> unmount mounts in cases where you wouldn't with umount. That might be > >>> fine though as that's ultimately the goal here in a way. > >>> > >>> One could also see a very useful feature in this where you require > >>> capable(CAP_DAC_READ_SEARCH) and capable(CAP_SYS_ADMIN) and then allow > >>> unmounting any mount in the system by mount-id. This would obviously be > >>> very useful for privileged service managers but I haven't thought this > >>> Through. > >> > >> That is exactly why having a separate syscall to do the lookup of the mount-id is good: it provides separation of privilege. > >> > >> The conversion of mount-id to an O_PATH file descriptor is just akin to a path lookup, so only needs CAP_DAC_READ_SEARCH (since you require privilege only to bypass the ACL directory read and lookup restrictions). The resulting O_PATH file descriptor has no special properties that require any further privilege. > >> > >> Then use that resulting file descriptor for the umount, which normally requires CAP_SYS_ADMIN. > > > > There's a difference between unmounting directly by providing a mount id > > and getting an O_PATH file descriptor from a mnt-id. If you can simply > > unmount by mount-id it's useful for users that have CAP_DAC_READ_SEARCH > > in a container. Without it you likely need to require > > capable(CAP_DAC_READ_SEARCH) aka system level privileges just like > > open_to_handle_at() which makes this interface way less generic and > > usable. Otherwise you'd be able to get an O_PATH fd to something that > > you wouldn't be able to access through normal path lookup. > > > Being able to convert into an O_PATH descriptor gives you more options > than just unmounting. It should allow you to syncfs() before > unmounting. It should allow you to call open_tree() so you can > manipulate the filesystem that is no longer accessible by path walk > (e.g. so you can bind it elsewhere or move it). I'm not saying it's wrong. I'm just saying there are trade-offs. Sure this is useful but it'll need to be a pretty privileged api which might be fine. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-13 22:00 allowing for a completely cached umount(2) pathwalk Jeff Layton 2023-04-13 22:25 ` Andreas Dilger 2023-04-13 22:41 ` NeilBrown @ 2023-04-14 2:32 ` Al Viro 2023-04-14 10:01 ` Jeff Layton 2023-04-14 13:16 ` David Wysochanski 2023-04-16 23:13 ` [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible NeilBrown 3 siblings, 2 replies; 46+ messages in thread From: Al Viro @ 2023-04-14 2:32 UTC (permalink / raw) To: Jeff Layton Cc: Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, NeilBrown, Christoph Hellwig On Thu, Apr 13, 2023 at 06:00:42PM -0400, Jeff Layton wrote: > It describes a situation where there are nested NFS mounts on a client, > and one of the intermediate mounts ends up being unexported from the > server. In a situation like this, we end up being unable to pathwalk > down to the child mount of these unreachable dentries and can't unmount > anything, even as root. So umount -l the stuck sucker. What's the problem with that? > 2/ disallow ->lookup operations: a umount is about removing an existing > mount, so the dentries had better already be there. That changes the semantics; as it is, you need exec permissions on the entire path... > Is this a terrible idea? Are there potentially problems with > containerized setups if we were to do something like this? Are there > better ways to solve this problem (and others like it)? Maybe this would > be best done with a new UMOUNT_CACHED flag for umount2()? We already have lazy umount. And what will you do to symlinks you run into along the way? They *are* traversed; requiring the caller to canonicalize them will only shift the problem to userland... ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 2:32 ` Al Viro @ 2023-04-14 10:01 ` Jeff Layton 2023-04-14 12:18 ` Christian Brauner 2023-04-14 14:57 ` Al Viro 2023-04-14 13:16 ` David Wysochanski 1 sibling, 2 replies; 46+ messages in thread From: Jeff Layton @ 2023-04-14 10:01 UTC (permalink / raw) To: Al Viro Cc: Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, NeilBrown, Christoph Hellwig On Fri, 2023-04-14 at 03:32 +0100, Al Viro wrote: > On Thu, Apr 13, 2023 at 06:00:42PM -0400, Jeff Layton wrote: > > > It describes a situation where there are nested NFS mounts on a client, > > and one of the intermediate mounts ends up being unexported from the > > server. In a situation like this, we end up being unable to pathwalk > > down to the child mount of these unreachable dentries and can't unmount > > anything, even as root. > > So umount -l the stuck sucker. What's the problem with that? > You mean lazy umount the parent that is stuck? What happens to the child mount in that case? Is it also eventually cleaned up? > > 2/ disallow ->lookup operations: a umount is about removing an existing > > mount, so the dentries had better already be there. > > That changes the semantics; as it is, you need exec permissions on the > entire path... > Yep. But, I think it makes some sense to do so in the context of a umount. Mostly, umount is done by a privileged user anyway so avoiding permission checks isn't too great a stretch, IMO. > > Is this a terrible idea? Are there potentially problems with > > containerized setups if we were to do something like this? Are there > > better ways to solve this problem (and others like it)? Maybe this would > > be best done with a new UMOUNT_CACHED flag for umount2()? > > We already have lazy umount. And what will you do to symlinks you run > into along the way? They *are* traversed; requiring the caller to > canonicalize them will only shift the problem to userland... Yeah, I hadn't considered symlinks here. Still, if we have a cached symlink dentry, wouldn't we also already have the symlink target in cache too? Or is that not guaranteed? -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 10:01 ` Jeff Layton @ 2023-04-14 12:18 ` Christian Brauner 2023-04-14 14:57 ` Al Viro 1 sibling, 0 replies; 46+ messages in thread From: Christian Brauner @ 2023-04-14 12:18 UTC (permalink / raw) To: Jeff Layton Cc: Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, NeilBrown, Christoph Hellwig On Fri, Apr 14, 2023 at 06:01:59AM -0400, Jeff Layton wrote: > On Fri, 2023-04-14 at 03:32 +0100, Al Viro wrote: > > On Thu, Apr 13, 2023 at 06:00:42PM -0400, Jeff Layton wrote: > > > > > It describes a situation where there are nested NFS mounts on a client, > > > and one of the intermediate mounts ends up being unexported from the > > > server. In a situation like this, we end up being unable to pathwalk > > > down to the child mount of these unreachable dentries and can't unmount > > > anything, even as root. > > > > So umount -l the stuck sucker. What's the problem with that? > > > > You mean lazy umount the parent that is stuck? What happens to the child > mount in that case? Is it also eventually cleaned up? I hope it's ok I barge in to answer this but due to the mount beneath patches I was working on I did spend even more time in that code then I already did. So this is good chance to get yelled at if I analyzed these codepaths wrong. The child mount would be unmounted in that case. umount_tree() is what you want to be looking at. If you perform a regular umount _without_ MNT_DETACH you can see that umount_tree() is effectively guarded by a call to propagate_mount_busy(). It checks wether the direct umount target has any child mounts and if so refuses the umount with EBUSY: mkdir -p /mnt/a/b /mnt/c /mnt/d # Create parent mount of a@c mount --bind /mnt/a /mnt/c # create child d@b which as child mount of a@c mount --bind /mnt/d /mnt/c/b If you call umount /mnt/c it will fail because a@c has child mounts. If you do a lazy umount via MNT_DETACH through umount -l /mnt/c then it will also unmount all children of a@c. In fact it will even include children of children... mkdir /mnt/c/b/e mount --bind /mnt/a/b/ /mnt/c/b/e umount -l /mnt/c That's basically what the next_mnt() loop at the beginning of umount_tree() is doing where it collects all direct targets to umount. However, if mount propagation is in play things get a lot nastier as you can fail a non-MNT_DETACH umount because of it as well (Note that umount propagation is always triggered if the parent mount of your direct umount target is a shared mount. IOW, you can't easily opt out of it unless you make the parent mount of your immediate umount target a non-shared mount.). A trivial reason that comes to mind where you would fail the umount due to mount propagation would where a propagated mount is kept busy and not the original mount. So similar to above on the host do: mkdir -p /mnt/a/b /mnt/c /mnt/d mount --bind /mnt/a /mnt/c umount /mnt/c and you would expect the umount /mnt/c to work. But you realize it fails with EBUSY but noone is referencing that mount anymore at least not in an obvious way. But assume someone had a mount namespace open that receives mount propagation from /. In that case the a@c mount would have propagated into that mount namespace. So someone could've cd /mnt/c into that propagated mount and the umount /mnt/c would fail. In that case propagate_mount_busy() would detect the increased refcount when it tries to check whether the umount could be propagated and give you EBUSY. So here you also need a lazy umount to get rid of that mount... And there are other nice scenarios where that's hard to figure out. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 10:01 ` Jeff Layton 2023-04-14 12:18 ` Christian Brauner @ 2023-04-14 14:57 ` Al Viro 1 sibling, 0 replies; 46+ messages in thread From: Al Viro @ 2023-04-14 14:57 UTC (permalink / raw) To: Jeff Layton Cc: Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, NeilBrown, Christoph Hellwig On Fri, Apr 14, 2023 at 06:01:59AM -0400, Jeff Layton wrote: > On Fri, 2023-04-14 at 03:32 +0100, Al Viro wrote: > > On Thu, Apr 13, 2023 at 06:00:42PM -0400, Jeff Layton wrote: > > > > > It describes a situation where there are nested NFS mounts on a client, > > > and one of the intermediate mounts ends up being unexported from the > > > server. In a situation like this, we end up being unable to pathwalk > > > down to the child mount of these unreachable dentries and can't unmount > > > anything, even as root. > > > > So umount -l the stuck sucker. What's the problem with that? > > > > You mean lazy umount the parent that is stuck? What happens to the child > mount in that case? Is it also eventually cleaned up? Yes. Lazy umount takes out the contributions to refcount due to being present in mount tree; as soon as other references go away (opened files, current directories, in-progress syscalls on files in there) struct mount instance releases the active reference to filesystem instance (struct super_block) and goes away. Once all active references to filesystem instance are gone, it gets shut down. > Yeah, I hadn't considered symlinks here. Still, if we have a cached > symlink dentry, wouldn't we also already have the symlink target in > cache too? Or is that not guaranteed? Not at all. What's more, why would we have that symlink dentry cached in the first place? If you suddenly impose that kind of restrictions on umount(2), you are pretty much guaranteed to break userland... Symlink dentries are rarely pinned, BTW - they may very well be cached if we hit them often enough, but outside of the things like lchown()/lstat() or pathname resolution in progress that is currently traversing that symlink... refcount is going to be zero. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: allowing for a completely cached umount(2) pathwalk 2023-04-14 2:32 ` Al Viro 2023-04-14 10:01 ` Jeff Layton @ 2023-04-14 13:16 ` David Wysochanski 1 sibling, 0 replies; 46+ messages in thread From: David Wysochanski @ 2023-04-14 13:16 UTC (permalink / raw) To: Al Viro Cc: Jeff Layton, Christian Brauner, linux-fsdevel, linux-nfs, David Howells, NeilBrown, Christoph Hellwig On Thu, Apr 13, 2023 at 10:34 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Thu, Apr 13, 2023 at 06:00:42PM -0400, Jeff Layton wrote: > > > It describes a situation where there are nested NFS mounts on a client, > > and one of the intermediate mounts ends up being unexported from the > > server. In a situation like this, we end up being unable to pathwalk > > down to the child mount of these unreachable dentries and can't unmount > > anything, even as root. > > So umount -l the stuck sucker. What's the problem with that? > Speaking in the context of the original reproducer: This does work if you "umount -l /A" - the underlying "/A" mount, but not if you do "umount -l /A/B" (both have lost their access on the server). The problem with this is IMO, it is not very intuitive to lazy unmount the root of a whole tree like that. Also, the customer's case was autofs, and I don't think autofs does this, it tries to umount the children first and gets stuck there in this scenario. But overall yes it does make sense, IMO. > > 2/ disallow ->lookup operations: a umount is about removing an existing > > mount, so the dentries had better already be there. > > That changes the semantics; as it is, you need exec permissions on the > entire path... > > > Is this a terrible idea? Are there potentially problems with > > containerized setups if we were to do something like this? Are there > > better ways to solve this problem (and others like it)? Maybe this would > > be best done with a new UMOUNT_CACHED flag for umount2()? > > We already have lazy umount. And what will you do to symlinks you run > into along the way? They *are* traversed; requiring the caller to > canonicalize them will only shift the problem to userland... > ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-13 22:00 allowing for a completely cached umount(2) pathwalk Jeff Layton ` (2 preceding siblings ...) 2023-04-14 2:32 ` Al Viro @ 2023-04-16 23:13 ` NeilBrown 2023-04-17 11:55 ` Christian Brauner 2023-04-17 12:09 ` Jeff Layton 3 siblings, 2 replies; 46+ messages in thread From: NeilBrown @ 2023-04-16 23:13 UTC (permalink / raw) To: Jeff Layton Cc: Al Viro, Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig When performing a LOOKUP_MOUNTPOINT lookup we don't really want to engage with underlying systems at all. Any mount point MUST be in the dcache with a complete direct path from the root to the mountpoint. That should be sufficient to find the mountpoint given a path name. This becomes an issue when the filesystem changes unexpected, such as when a NFS server is changed to reject all access. It then becomes impossible to unmount anything mounted on the filesystem which has changed. We could simply lazy-unmount the changed filesystem and that will often be sufficient. However if the target filesystem needs "umount -f" to complete the unmount properly, then the lazy unmount will leave it incompletely unmounted. When "-f" is needed, we really need to be able to name the target filesystem. We NEVER want to revalidate anything. We already avoid the revalidation of the mountpoint itself, be we won't need to revalidate anything on the path either as thay might affect the cache, and the cache is what we are really looking in. Permission checks are a little less clear. We currently allow any user to make the umount syscall and perform the path lookup and only reject unprivileged users once the target mount point has been found. If we completely relax permission checks then an unprivileged user could probe inaccessible parts of the name space by examining the error returned from umount(). So we only relax permission check when the user has CAP_SYS_ADMIN (may_mount() succeeds). Note that if the path given is not direct and for example uses symlinks or "..", then dentries or symlink content may not be cached and a remote server could cause problem. We can only be certain of a safe unmount if a direct path is used. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 40 insertions(+), 6 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index edfedfbccaef..f2df1adae7c5 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) * * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. */ -int inode_permission(struct mnt_idmap *idmap, - struct inode *inode, int mask) +int inode_permission_mp(struct mnt_idmap *idmap, + struct inode *inode, int mask, bool mp) { int retval; @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap, return -EACCES; } - retval = do_inode_permission(idmap, inode, mask); + if (mp) + /* We are looking for a mountpoint and so don't + * want to interact with the filesystem at all, just + * the dcache and icache. + */ + retval = generic_permission(idmap, inode, mask); + else + retval = do_inode_permission(idmap, inode, mask); if (retval) return retval; @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap, return security_inode_permission(inode, mask); } + +/** + * inode_permission - Check for access rights to a given inode + * @idmap: idmap of the mount the inode was found from + * @inode: Inode to check permission on + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) + * + * Check for read/write/execute permissions on an inode. We use fs[ug]id for + * this, letting us set arbitrary permissions for filesystem access without + * changing the "normal" UIDs which are used for other things. + * + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. + */ +int inode_permission(struct mnt_idmap *idmap, + struct inode *inode, int mask) +{ + return inode_permission_mp(idmap, inode, mask, false); +} EXPORT_SYMBOL(inode_permission); /** @@ -589,6 +614,7 @@ struct nameidata { #define ND_ROOT_PRESET 1 #define ND_ROOT_GRABBED 2 #define ND_JUMPED 4 +#define ND_SYS_ADMIN 8 static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) { @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) static inline int d_revalidate(struct dentry *dentry, unsigned int flags) { - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) && + likely(!(flags & LOOKUP_MOUNTPOINT))) return dentry->d_op->d_revalidate(dentry, flags); else return 1; @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name, static inline int may_lookup(struct mnt_idmap *idmap, struct nameidata *nd) { + /* If we are looking for a mountpoint and we have the SYS_ADMIN + * capability, then we will by-pass the filesys for permission checks + * and just use generic_permission(). + */ + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN); if (nd->flags & LOOKUP_RCU) { - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK); + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp); if (err != -ECHILD || !try_to_unlazy(nd)) return err; } - return inode_permission(idmap, nd->inode, MAY_EXEC); + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp); } static int reserve_stack(struct nameidata *nd, struct path *link) @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, if (IS_ERR(name)) return PTR_ERR(name); set_nameidata(&nd, dfd, name, root); + if ((flags & LOOKUP_MOUNTPOINT) && may_mount()) + nd.state = ND_SYS_ADMIN; retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); if (unlikely(retval == -ECHILD)) retval = path_lookupat(&nd, flags, path); -- 2.40.0 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-16 23:13 ` [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible NeilBrown @ 2023-04-17 11:55 ` Christian Brauner 2023-04-17 12:25 ` Jeff Layton 2023-04-17 21:26 ` NeilBrown 2023-04-17 12:09 ` Jeff Layton 1 sibling, 2 replies; 46+ messages in thread From: Christian Brauner @ 2023-04-17 11:55 UTC (permalink / raw) To: NeilBrown Cc: Jeff Layton, Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote: > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > engage with underlying systems at all. Any mount point MUST be in the > dcache with a complete direct path from the root to the mountpoint. > That should be sufficient to find the mountpoint given a path name. > > This becomes an issue when the filesystem changes unexpected, such as > when a NFS server is changed to reject all access. It then becomes > impossible to unmount anything mounted on the filesystem which has > changed. We could simply lazy-unmount the changed filesystem and that > will often be sufficient. However if the target filesystem needs > "umount -f" to complete the unmount properly, then the lazy unmount will > leave it incompletely unmounted. When "-f" is needed, we really need to I don't understand this yet. All I see is nfs_umount_begin() that's different for MNT_FORCE to kill remaining io. Why does that preclude MNT_DETACH? You might very well fail MNT_FORCE and the only way you can get rid is to use MNT_DETACH, no? So I don't see why that is an argument. > be able to name the target filesystem. > > We NEVER want to revalidate anything. We already avoid the revalidation > of the mountpoint itself, be we won't need to revalidate anything on the > path either as thay might affect the cache, and the cache is what we are > really looking in. > > Permission checks are a little less clear. We currently allow any user This is all very brittle. First case that comes to mind is overlayfs where the permission checking is performed twice. Once on the overlayfs inode itself based on the caller's security context and a second time on the underlying inode with the security context of the mounter of the overlayfs instance. A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so they'd be able to mount the overlayfs instance but would be restricted from accessing certain directories or files. The task accessing the overlayfs instance however could have a completely different security context including CAP_DAC_READ_SEARCH and such. Both tasks could reasonably be in different user namespaces and so on. The LSM hooks are also called twice and would now also be called once. It also forgets that acl_permission() check may very well call into the filesystem via check_acl(). So umount could either be used to infer existence of files that the user wouldn't otherwise know they exist or in the worst case allow to umount something that they wouldn't have access to. Aside from that this would break userspace assumptions and as Al and I've mentioned before in the other thread you'd need a new flag to umount2() for this. The permission model can't just change behind users back. But I dislike it for the now even more special-cased umount path lookup alone tbh. I'd feel way more comfortable with a non-lookup related solution that doesn't have subtle implications for permission checking. > to make the umount syscall and perform the path lookup and only reject > unprivileged users once the target mount point has been found. If we > completely relax permission checks then an unprivileged user could probe > inaccessible parts of the name space by examining the error returned > from umount(). > > So we only relax permission check when the user has CAP_SYS_ADMIN > (may_mount() succeeds). > > Note that if the path given is not direct and for example uses symlinks > or "..", then dentries or symlink content may not be cached and a remote > server could cause problem. We can only be certain of a safe unmount if > a direct path is used. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 40 insertions(+), 6 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index edfedfbccaef..f2df1adae7c5 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > * > * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > */ > -int inode_permission(struct mnt_idmap *idmap, > - struct inode *inode, int mask) > +int inode_permission_mp(struct mnt_idmap *idmap, > + struct inode *inode, int mask, bool mp) > { > int retval; > > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap, > return -EACCES; > } > > - retval = do_inode_permission(idmap, inode, mask); > + if (mp) > + /* We are looking for a mountpoint and so don't > + * want to interact with the filesystem at all, just > + * the dcache and icache. > + */ > + retval = generic_permission(idmap, inode, mask); > + else > + retval = do_inode_permission(idmap, inode, mask); > if (retval) > return retval; > > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap, > > return security_inode_permission(inode, mask); > } > + > +/** > + * inode_permission - Check for access rights to a given inode > + * @idmap: idmap of the mount the inode was found from > + * @inode: Inode to check permission on > + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > + * > + * Check for read/write/execute permissions on an inode. We use fs[ug]id for > + * this, letting us set arbitrary permissions for filesystem access without > + * changing the "normal" UIDs which are used for other things. > + * > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > + */ > +int inode_permission(struct mnt_idmap *idmap, > + struct inode *inode, int mask) > +{ > + return inode_permission_mp(idmap, inode, mask, false); > +} > EXPORT_SYMBOL(inode_permission); > > /** > @@ -589,6 +614,7 @@ struct nameidata { > #define ND_ROOT_PRESET 1 > #define ND_ROOT_GRABBED 2 > #define ND_JUMPED 4 > +#define ND_SYS_ADMIN 8 > > static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) > { > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) > > static inline int d_revalidate(struct dentry *dentry, unsigned int flags) > { > - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) > + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) && > + likely(!(flags & LOOKUP_MOUNTPOINT))) > return dentry->d_op->d_revalidate(dentry, flags); > else > return 1; > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name, > static inline int may_lookup(struct mnt_idmap *idmap, > struct nameidata *nd) > { > + /* If we are looking for a mountpoint and we have the SYS_ADMIN > + * capability, then we will by-pass the filesys for permission checks > + * and just use generic_permission(). > + */ > + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN); > if (nd->flags & LOOKUP_RCU) { > - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK); > + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp); > if (err != -ECHILD || !try_to_unlazy(nd)) > return err; > } > - return inode_permission(idmap, nd->inode, MAY_EXEC); > + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp); > } > > static int reserve_stack(struct nameidata *nd, struct path *link) > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, > if (IS_ERR(name)) > return PTR_ERR(name); > set_nameidata(&nd, dfd, name, root); > + if ((flags & LOOKUP_MOUNTPOINT) && may_mount()) > + nd.state = ND_SYS_ADMIN; > retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); > if (unlikely(retval == -ECHILD)) > retval = path_lookupat(&nd, flags, path); > -- > 2.40.0 > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-17 11:55 ` Christian Brauner @ 2023-04-17 12:25 ` Jeff Layton 2023-04-17 14:24 ` Christian Brauner 2023-04-17 21:26 ` NeilBrown 1 sibling, 1 reply; 46+ messages in thread From: Jeff Layton @ 2023-04-17 12:25 UTC (permalink / raw) To: Christian Brauner, NeilBrown Cc: Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote: > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote: > > > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > > engage with underlying systems at all. Any mount point MUST be in the > > dcache with a complete direct path from the root to the mountpoint. > > That should be sufficient to find the mountpoint given a path name. > > > > This becomes an issue when the filesystem changes unexpected, such as > > when a NFS server is changed to reject all access. It then becomes > > impossible to unmount anything mounted on the filesystem which has > > changed. We could simply lazy-unmount the changed filesystem and that > > will often be sufficient. However if the target filesystem needs > > "umount -f" to complete the unmount properly, then the lazy unmount will > > leave it incompletely unmounted. When "-f" is needed, we really need to > > I don't understand this yet. All I see is nfs_umount_begin() that's > different for MNT_FORCE to kill remaining io. Why does that preclude > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can > get rid is to use MNT_DETACH, no? So I don't see why that is an > argument. > > > be able to name the target filesystem. > > > > We NEVER want to revalidate anything. We already avoid the revalidation > > of the mountpoint itself, be we won't need to revalidate anything on the > > path either as thay might affect the cache, and the cache is what we are > > really looking in. > > > > Permission checks are a little less clear. We currently allow any user > > This is all very brittle. > > First case that comes to mind is overlayfs where the permission checking > is performed twice. Once on the overlayfs inode itself based on the > caller's security context and a second time on the underlying inode with > the security context of the mounter of the overlayfs instance. > > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so > they'd be able to mount the overlayfs instance but would be restricted > from accessing certain directories or files. The task accessing the > overlayfs instance however could have a completely different security > context including CAP_DAC_READ_SEARCH and such. Both tasks could > reasonably be in different user namespaces and so on. > > The LSM hooks are also called twice and would now also be called once. > > It also forgets that acl_permission() check may very well call into the > filesystem via check_acl(). > > So umount could either be used to infer existence of files that the user > wouldn't otherwise know they exist or in the worst case allow to umount > something that they wouldn't have access to. > > Aside from that this would break userspace assumptions and as Al and > I've mentioned before in the other thread you'd need a new flag to > umount2() for this. The permission model can't just change behind users > back. > > But I dislike it for the now even more special-cased umount path lookup > alone tbh. I'd feel way more comfortable with a non-lookup related > solution that doesn't have subtle implications for permission checking. > These are good points. One way around the issues you point out might be to pass down a new MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the filesystem driver to decide whether it wants to avoid potentially problematic activity when checking permissions. nfs_permission could check for that and take a more hands-off approach to the permissions check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I think that might do what we need. > > to make the umount syscall and perform the path lookup and only reject > > unprivileged users once the target mount point has been found. If we > > completely relax permission checks then an unprivileged user could probe > > inaccessible parts of the name space by examining the error returned > > from umount(). > > > > So we only relax permission check when the user has CAP_SYS_ADMIN > > (may_mount() succeeds). > > > > Note that if the path given is not direct and for example uses symlinks > > or "..", then dentries or symlink content may not be cached and a remote > > server could cause problem. We can only be certain of a safe unmount if > > a direct path is used. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------ > > 1 file changed, 40 insertions(+), 6 deletions(-) > > > > diff --git a/fs/namei.c b/fs/namei.c > > index edfedfbccaef..f2df1adae7c5 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > > * > > * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > > */ > > -int inode_permission(struct mnt_idmap *idmap, > > - struct inode *inode, int mask) > > +int inode_permission_mp(struct mnt_idmap *idmap, > > + struct inode *inode, int mask, bool mp) > > { > > int retval; > > > > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap, > > return -EACCES; > > } > > > > - retval = do_inode_permission(idmap, inode, mask); > > + if (mp) > > + /* We are looking for a mountpoint and so don't > > + * want to interact with the filesystem at all, just > > + * the dcache and icache. > > + */ > > + retval = generic_permission(idmap, inode, mask); > > + else > > + retval = do_inode_permission(idmap, inode, mask); > > if (retval) > > return retval; > > > > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap, > > > > return security_inode_permission(inode, mask); > > } > > + > > +/** > > + * inode_permission - Check for access rights to a given inode > > + * @idmap: idmap of the mount the inode was found from > > + * @inode: Inode to check permission on > > + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > > + * > > + * Check for read/write/execute permissions on an inode. We use fs[ug]id for > > + * this, letting us set arbitrary permissions for filesystem access without > > + * changing the "normal" UIDs which are used for other things. > > + * > > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > > + */ > > +int inode_permission(struct mnt_idmap *idmap, > > + struct inode *inode, int mask) > > +{ > > + return inode_permission_mp(idmap, inode, mask, false); > > +} > > EXPORT_SYMBOL(inode_permission); > > > > /** > > @@ -589,6 +614,7 @@ struct nameidata { > > #define ND_ROOT_PRESET 1 > > #define ND_ROOT_GRABBED 2 > > #define ND_JUMPED 4 > > +#define ND_SYS_ADMIN 8 > > > > static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) > > { > > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) > > > > static inline int d_revalidate(struct dentry *dentry, unsigned int flags) > > { > > - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) > > + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) && > > + likely(!(flags & LOOKUP_MOUNTPOINT))) > > return dentry->d_op->d_revalidate(dentry, flags); > > else > > return 1; > > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name, > > static inline int may_lookup(struct mnt_idmap *idmap, > > struct nameidata *nd) > > { > > + /* If we are looking for a mountpoint and we have the SYS_ADMIN > > + * capability, then we will by-pass the filesys for permission checks > > + * and just use generic_permission(). > > + */ > > + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN); > > if (nd->flags & LOOKUP_RCU) { > > - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK); > > + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp); > > if (err != -ECHILD || !try_to_unlazy(nd)) > > return err; > > } > > - return inode_permission(idmap, nd->inode, MAY_EXEC); > > + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp); > > } > > > > static int reserve_stack(struct nameidata *nd, struct path *link) > > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, > > if (IS_ERR(name)) > > return PTR_ERR(name); > > set_nameidata(&nd, dfd, name, root); > > + if ((flags & LOOKUP_MOUNTPOINT) && may_mount()) > > + nd.state = ND_SYS_ADMIN; > > retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); > > if (unlikely(retval == -ECHILD)) > > retval = path_lookupat(&nd, flags, path); > > -- > > 2.40.0 > > -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-17 12:25 ` Jeff Layton @ 2023-04-17 14:24 ` Christian Brauner 2023-04-17 15:21 ` Jeff Layton 0 siblings, 1 reply; 46+ messages in thread From: Christian Brauner @ 2023-04-17 14:24 UTC (permalink / raw) To: Jeff Layton Cc: NeilBrown, Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Mon, Apr 17, 2023 at 08:25:23AM -0400, Jeff Layton wrote: > On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote: > > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote: > > > > > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > > > engage with underlying systems at all. Any mount point MUST be in the > > > dcache with a complete direct path from the root to the mountpoint. > > > That should be sufficient to find the mountpoint given a path name. > > > > > > This becomes an issue when the filesystem changes unexpected, such as > > > when a NFS server is changed to reject all access. It then becomes > > > impossible to unmount anything mounted on the filesystem which has > > > changed. We could simply lazy-unmount the changed filesystem and that > > > will often be sufficient. However if the target filesystem needs > > > "umount -f" to complete the unmount properly, then the lazy unmount will > > > leave it incompletely unmounted. When "-f" is needed, we really need to > > > > I don't understand this yet. All I see is nfs_umount_begin() that's > > different for MNT_FORCE to kill remaining io. Why does that preclude > > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can > > get rid is to use MNT_DETACH, no? So I don't see why that is an > > argument. > > > > > be able to name the target filesystem. > > > > > > We NEVER want to revalidate anything. We already avoid the revalidation > > > of the mountpoint itself, be we won't need to revalidate anything on the > > > path either as thay might affect the cache, and the cache is what we are > > > really looking in. > > > > > > Permission checks are a little less clear. We currently allow any user > > > > This is all very brittle. > > > > First case that comes to mind is overlayfs where the permission checking > > is performed twice. Once on the overlayfs inode itself based on the > > caller's security context and a second time on the underlying inode with > > the security context of the mounter of the overlayfs instance. > > > > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so > > they'd be able to mount the overlayfs instance but would be restricted > > from accessing certain directories or files. The task accessing the > > overlayfs instance however could have a completely different security > > context including CAP_DAC_READ_SEARCH and such. Both tasks could > > reasonably be in different user namespaces and so on. > > > > The LSM hooks are also called twice and would now also be called once. > > > > It also forgets that acl_permission() check may very well call into the > > filesystem via check_acl(). > > > > So umount could either be used to infer existence of files that the user > > wouldn't otherwise know they exist or in the worst case allow to umount > > something that they wouldn't have access to. > > > > Aside from that this would break userspace assumptions and as Al and > > I've mentioned before in the other thread you'd need a new flag to > > umount2() for this. The permission model can't just change behind users > > back. > > > > But I dislike it for the now even more special-cased umount path lookup > > alone tbh. I'd feel way more comfortable with a non-lookup related > > solution that doesn't have subtle implications for permission checking. > > > > These are good points. > > One way around the issues you point out might be to pass down a new > MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the > filesystem driver to decide whether it wants to avoid potentially > problematic activity when checking permissions. nfs_permission could > check for that and take a more hands-off approach to the permissions > check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I > think that might do what we need. Yes, that's pretty obvious. I considered that, wrote the section and deleted it again because I still find this pretty ugly. It does leak very specific lookup information into filesystems that isn't generically useful like MAY_NOT_BLOCK is. Most filesystems don't need to check with a server like NFS does and so don't suffer from this issue. The crucial change in the patchset in its current form is that you're requesting from the VFS to significantly alter permission checking just because this is a umount request in a pretty fundamental way for roughly 21 filesytems. Afaict, on the VFS level that doesn't make sense. The VFS can't just skip a filesystem's ->permission() handler with "well, it's just on a way to a umount so whatever". That's just not going to be correct and is just a source of subtle security bugs. So NAK on that. And I'm curious why is it obvious that we don't want to revalidate _any_ path component and not just the last one? Why is that generally safe? Why can't this be used to access files and directories the caller wouldn't otherwise be able to access? I would like to have this spelled out for slow people like me, please. From my point of view, this would only be somewhat safe _generally_ if you'd allow circumvention for revalidation and permission checking if MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). You'd still mess with overlayfs permission model in this case though. Plus, there are better options of solving this problem. Again, I'd rather build a separate api for unmounting then playing such potentially subtle security sensitive games with permission checking during path lookup. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-17 14:24 ` Christian Brauner @ 2023-04-17 15:21 ` Jeff Layton 2023-04-17 21:34 ` NeilBrown 2023-04-18 3:25 ` Andreas Dilger 0 siblings, 2 replies; 46+ messages in thread From: Jeff Layton @ 2023-04-17 15:21 UTC (permalink / raw) To: Christian Brauner Cc: NeilBrown, Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote: > On Mon, Apr 17, 2023 at 08:25:23AM -0400, Jeff Layton wrote: > > On Mon, 2023-04-17 at 13:55 +0200, Christian Brauner wrote: > > > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote: > > > > > > > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > > > > engage with underlying systems at all. Any mount point MUST be in the > > > > dcache with a complete direct path from the root to the mountpoint. > > > > That should be sufficient to find the mountpoint given a path name. > > > > > > > > This becomes an issue when the filesystem changes unexpected, such as > > > > when a NFS server is changed to reject all access. It then becomes > > > > impossible to unmount anything mounted on the filesystem which has > > > > changed. We could simply lazy-unmount the changed filesystem and that > > > > will often be sufficient. However if the target filesystem needs > > > > "umount -f" to complete the unmount properly, then the lazy unmount will > > > > leave it incompletely unmounted. When "-f" is needed, we really need to > > > > > > I don't understand this yet. All I see is nfs_umount_begin() that's > > > different for MNT_FORCE to kill remaining io. Why does that preclude > > > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can > > > get rid is to use MNT_DETACH, no? So I don't see why that is an > > > argument. > > > > > > > be able to name the target filesystem. > > > > > > > > We NEVER want to revalidate anything. We already avoid the revalidation > > > > of the mountpoint itself, be we won't need to revalidate anything on the > > > > path either as thay might affect the cache, and the cache is what we are > > > > really looking in. > > > > > > > > Permission checks are a little less clear. We currently allow any user > > > > > > This is all very brittle. > > > > > > First case that comes to mind is overlayfs where the permission checking > > > is performed twice. Once on the overlayfs inode itself based on the > > > caller's security context and a second time on the underlying inode with > > > the security context of the mounter of the overlayfs instance. > > > > > > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so > > > they'd be able to mount the overlayfs instance but would be restricted > > > from accessing certain directories or files. The task accessing the > > > overlayfs instance however could have a completely different security > > > context including CAP_DAC_READ_SEARCH and such. Both tasks could > > > reasonably be in different user namespaces and so on. > > > > > > The LSM hooks are also called twice and would now also be called once. > > > > > > It also forgets that acl_permission() check may very well call into the > > > filesystem via check_acl(). > > > > > > So umount could either be used to infer existence of files that the user > > > wouldn't otherwise know they exist or in the worst case allow to umount > > > something that they wouldn't have access to. > > > > > > Aside from that this would break userspace assumptions and as Al and > > > I've mentioned before in the other thread you'd need a new flag to > > > umount2() for this. The permission model can't just change behind users > > > back. > > > > > > But I dislike it for the now even more special-cased umount path lookup > > > alone tbh. I'd feel way more comfortable with a non-lookup related > > > solution that doesn't have subtle implications for permission checking. > > > > > > > These are good points. > > > > One way around the issues you point out might be to pass down a new > > MAY_LOOKUP_MOUNTPOINT mask flag to ->permission. That would allow the > > filesystem driver to decide whether it wants to avoid potentially > > problematic activity when checking permissions. nfs_permission could > > check for that and take a more hands-off approach to the permissions > > check. Between that and skipping d_revalidate on LOOKUP_MOUNTPOINT, I > > think that might do what we need. > > Yes, that's pretty obvious. I considered that, wrote the section and > deleted it again because I still find this pretty ugly. It does leak > very specific lookup information into filesystems that isn't generically > useful like MAY_NOT_BLOCK is. Most filesystems don't need to check with > a server like NFS does and so don't suffer from this issue. > Sort of. Most of the MAY flags cover a specific operation. In this case, we'd just be adding a flag to make it clear that this permission check is for the specific purpose of chasing down a mountpoint (presumably to umount). This field does look less like a "mask" field now though, and more like a "flags". > The crucial change in the patchset in its current form is that you're > requesting from the VFS to significantly alter permission checking just > because this is a umount request in a pretty fundamental way for roughly > 21 filesytems. Afaict, on the VFS level that doesn't make sense. The VFS > can't just skip a filesystem's ->permission() handler with "well, it's > just on a way to a umount so whatever". That's just not going to be > correct and is just a source of subtle security bugs. So NAK on that. > Fair enough. I too think the permission check ought to be left up to the filesystem driver. It does need some way to know that the permission check is in the context of a LOOKUP_MOUNTPOINT pathwalk though. A MAY_* flag seems like the obvious choice, since we could use that for checking ACLs too, but maybe there is some better way. > And I'm curious why is it obvious that we don't want to revalidate _any_ > path component and not just the last one? Why is that generally safe? > Why can't this be used to access files and directories the caller > wouldn't otherwise be able to access? I would like to have this spelled > out for slow people like me, please. > > From my point of view, this would only be somewhat safe _generally_ if > you'd allow circumvention for revalidation and permission checking if > MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). > You'd still mess with overlayfs permission model in this case though. > > Plus, there are better options of solving this problem. Again, I'd > rather build a separate api for unmounting then playing such potentially > subtle security sensitive games with permission checking during path > lookup. umount(2) is really a special case because the whole intent is to detach a mount from the local hierarchy and stop using it. The unfortunate bit is that it is a path-based syscall. So usually we have to: - determine the path: Maybe stat() it and to validate that it's the mountpoint we want to drop - then call umount with that path The last thing we want in that case is for the server to decide to change some intermediate dentry in between the two operations. Best case, you'll get back ENOENT or something when the pathwalk fails. Worst case, the server swaps what are two different mountpoints on your client and you unmount the wrong one. If we don't revaliate, then we're no worse off, and may be better off if something hinky happens to the server of an intermediate dentry in the path. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-17 15:21 ` Jeff Layton @ 2023-04-17 21:34 ` NeilBrown 2023-04-18 8:10 ` Christian Brauner 2023-04-18 3:25 ` Andreas Dilger 1 sibling, 1 reply; 46+ messages in thread From: NeilBrown @ 2023-04-17 21:34 UTC (permalink / raw) To: Jeff Layton Cc: Christian Brauner, Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Tue, 18 Apr 2023, Jeff Layton wrote: > > The last thing we want in that case is for the server to decide to > change some intermediate dentry in between the two operations. Best > case, you'll get back ENOENT or something when the pathwalk fails. Worst > case, the server swaps what are two different mountpoints on your client > and you unmount the wrong one. Actually, the last think I want is for the server to do something that causes a directory to be invalidated (d_invalidate()). Then any mount points below there get DETACHed and we lose any change to use MNT_FORCE or to wait for the unmount to complete. Of course this can also happen during any other access, not just umount.... > > If we don't revaliate, then we're no worse off, and may be better off if > something hinky happens to the server of an intermediate dentry in the > path. Exactly. If we don't revalidate we might use old data, but it will be old data that were were allowed to access. It might be data that we are not now allowed to access, but it will never be new data that were have never been allowed to access. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-17 21:34 ` NeilBrown @ 2023-04-18 8:10 ` Christian Brauner 0 siblings, 0 replies; 46+ messages in thread From: Christian Brauner @ 2023-04-18 8:10 UTC (permalink / raw) To: NeilBrown Cc: Jeff Layton, Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Tue, Apr 18, 2023 at 07:34:14AM +1000, NeilBrown wrote: > On Tue, 18 Apr 2023, Jeff Layton wrote: > > > > The last thing we want in that case is for the server to decide to > > change some intermediate dentry in between the two operations. Best > > case, you'll get back ENOENT or something when the pathwalk fails. Worst > > case, the server swaps what are two different mountpoints on your client > > and you unmount the wrong one. > > Actually, the last think I want is for the server to do something that > causes a directory to be invalidated (d_invalidate()). Then any mount > points below there get DETACHed and we lose any change to use MNT_FORCE > or to wait for the unmount to complete. Of course this can also happen > during any other access, not just umount.... Any rmdir/unlink from another mount namespace where the mountpoint isn't a local mountpoint would lazily unmount the whole mount tree. You can't guard against this anyway. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-17 15:21 ` Jeff Layton 2023-04-17 21:34 ` NeilBrown @ 2023-04-18 3:25 ` Andreas Dilger 2023-04-18 8:04 ` Christian Brauner 1 sibling, 1 reply; 46+ messages in thread From: Andreas Dilger @ 2023-04-18 3:25 UTC (permalink / raw) To: Jeff Layton Cc: Christian Brauner, NeilBrown, Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig, Karel Zak [-- Attachment #1: Type: text/plain, Size: 2503 bytes --] > On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote: > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote: >> And I'm curious why is it obvious that we don't want to revalidate _any_ >> path component and not just the last one? Why is that generally safe? >> Why can't this be used to access files and directories the caller >> wouldn't otherwise be able to access? I would like to have this spelled >> out for slow people like me, please. >> >> From my point of view, this would only be somewhat safe _generally_ if >> you'd allow circumvention for revalidation and permission checking if >> MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). >> You'd still mess with overlayfs permission model in this case though. >> >> Plus, there are better options of solving this problem. Again, I'd >> rather build a separate api for unmounting then playing such potentially >> subtle security sensitive games with permission checking during path >> lookup. > > umount(2) is really a special case because the whole intent is to detach > a mount from the local hierarchy and stop using it. The unfortunate bit > is that it is a path-based syscall. > > So usually we have to: > > - determine the path: Maybe stat() it and to validate that it's the > mountpoint we want to drop The stat() itself may hang because a remote server, or USB stick is inaccessible or having media errors. I've just been having a conversation with Karel Zak to change umount(1) to use statx() so that it interacts minimally with the fs. In particular, nfs_getattr() skips revalidate if only minimal attrs are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if locally-cached attrs are still valid (STATX_MODE), so this will avoid yet one more place that unmount can hang. In theory, vfs_getattr() could get all of these attributes directly from the vfs_inode in the unmount case. > - then call umount with that path > > The last thing we want in that case is for the server to decide to > change some intermediate dentry in between the two operations. Best > case, you'll get back ENOENT or something when the pathwalk fails. Worst > case, the server swaps what are two different mountpoints on your client > and you unmount the wrong one. > > If we don't revaliate, then we're no worse off, and may be better off if > something hinky happens to the server of an intermediate dentry in the > path. > -- > Jeff Layton <jlayton@kernel.org> Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP --] [-- Type: application/pgp-signature, Size: 873 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-18 3:25 ` Andreas Dilger @ 2023-04-18 8:04 ` Christian Brauner 2023-04-20 13:05 ` Jeff Layton 0 siblings, 1 reply; 46+ messages in thread From: Christian Brauner @ 2023-04-18 8:04 UTC (permalink / raw) To: Andreas Dilger Cc: Jeff Layton, NeilBrown, Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig, Karel Zak On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote: > > > On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote: > > > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote: > >> And I'm curious why is it obvious that we don't want to revalidate _any_ > >> path component and not just the last one? Why is that generally safe? > >> Why can't this be used to access files and directories the caller > >> wouldn't otherwise be able to access? I would like to have this spelled > >> out for slow people like me, please. > >> > >> From my point of view, this would only be somewhat safe _generally_ if > >> you'd allow circumvention for revalidation and permission checking if > >> MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). > >> You'd still mess with overlayfs permission model in this case though. > >> > >> Plus, there are better options of solving this problem. Again, I'd > >> rather build a separate api for unmounting then playing such potentially > >> subtle security sensitive games with permission checking during path > >> lookup. > > > > umount(2) is really a special case because the whole intent is to detach > > a mount from the local hierarchy and stop using it. The unfortunate bit > > is that it is a path-based syscall. > > > > So usually we have to: > > > > - determine the path: Maybe stat() it and to validate that it's the > > mountpoint we want to drop > > The stat() itself may hang because a remote server, or USB stick is > inaccessible or having media errors. > > I've just been having a conversation with Karel Zak to change > umount(1) to use statx() so that it interacts minimally with the fs. So we're talking about this commit: https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f and the patch we discussed here: https://github.com/util-linux/util-linux/issues/2049 > > In particular, nfs_getattr() skips revalidate if only minimal attrs > are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if > locally-cached attrs are still valid (STATX_MODE), so this will > avoid yet one more place that unmount can hang. > > In theory, vfs_getattr() could get all of these attributes directly > from the vfs_inode in the unmount case. We don't really need that. As pointed out in that discussion and as Karel did we just want to pass AT_STATX_DONT_SYNC. An api that would allow unmounting by mount id can safely check and retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID without ever syncing with the server. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-18 8:04 ` Christian Brauner @ 2023-04-20 13:05 ` Jeff Layton 2023-04-20 15:41 ` Christian Brauner 0 siblings, 1 reply; 46+ messages in thread From: Jeff Layton @ 2023-04-20 13:05 UTC (permalink / raw) To: Christian Brauner, Andreas Dilger Cc: NeilBrown, Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig, Karel Zak On Tue, 2023-04-18 at 10:04 +0200, Christian Brauner wrote: > On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote: > > > > > On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote: > > > > And I'm curious why is it obvious that we don't want to revalidate _any_ > > > > path component and not just the last one? Why is that generally safe? > > > > Why can't this be used to access files and directories the caller > > > > wouldn't otherwise be able to access? I would like to have this spelled > > > > out for slow people like me, please. > > > > > > > > From my point of view, this would only be somewhat safe _generally_ if > > > > you'd allow circumvention for revalidation and permission checking if > > > > MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). > > > > You'd still mess with overlayfs permission model in this case though. > > > > > > > > Plus, there are better options of solving this problem. Again, I'd > > > > rather build a separate api for unmounting then playing such potentially > > > > subtle security sensitive games with permission checking during path > > > > lookup. > > > > > > umount(2) is really a special case because the whole intent is to detach > > > a mount from the local hierarchy and stop using it. The unfortunate bit > > > is that it is a path-based syscall. > > > > > > So usually we have to: > > > > > > - determine the path: Maybe stat() it and to validate that it's the > > > mountpoint we want to drop > > > > The stat() itself may hang because a remote server, or USB stick is > > inaccessible or having media errors. > > > > I've just been having a conversation with Karel Zak to change > > umount(1) to use statx() so that it interacts minimally with the fs. > > So we're talking about this commit: > https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f > and the patch we discussed here: > https://github.com/util-linux/util-linux/issues/2049 > > > > > In particular, nfs_getattr() skips revalidate if only minimal attrs > > are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if > > locally-cached attrs are still valid (STATX_MODE), so this will > > avoid yet one more place that unmount can hang. > > > > In theory, vfs_getattr() could get all of these attributes directly > > from the vfs_inode in the unmount case. > > We don't really need that. As pointed out in that discussion and as > Karel did we just want to pass AT_STATX_DONT_SYNC. > > An api that would allow unmounting by mount id can safely check and > retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID > without ever syncing with the server. Unfortunately, I don't think that flag trickles down to the ->permission checks for the pathwalk down to the point you're stat'ing. That turns out to be a big part of the problem when trying to umount things that are stuck down in inaccessible trees. I'm not opposed at all to new APIs that allow for more reliable unmounting. I think that's a good idea, and something worth hashing out. But...we're stuck with umount() in perpetuity. Even if you were to merge a new API for unmounting today, it'll be a decade or more until the ecosystem catches up. I think we have a responsibility to make the umount syscall work as well as we can. In this case, that means giving it as light a footprint as possible on the pathwalk down. If we need to gate this behavior behind existing or new flags to umount2(), then so be it, but lets not dismiss this idea in favor of some new unmounting interface that may never materialize. Improving umount has the potential to help a lot of our users. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-20 13:05 ` Jeff Layton @ 2023-04-20 15:41 ` Christian Brauner 0 siblings, 0 replies; 46+ messages in thread From: Christian Brauner @ 2023-04-20 15:41 UTC (permalink / raw) To: Jeff Layton Cc: Andreas Dilger, NeilBrown, Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig, Karel Zak On Thu, Apr 20, 2023 at 09:05:47AM -0400, Jeff Layton wrote: > On Tue, 2023-04-18 at 10:04 +0200, Christian Brauner wrote: > > On Mon, Apr 17, 2023 at 09:25:20PM -0600, Andreas Dilger wrote: > > > > > > > On Apr 17, 2023, at 9:21 AM, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > > On Mon, 2023-04-17 at 16:24 +0200, Christian Brauner wrote: > > > > > And I'm curious why is it obvious that we don't want to revalidate _any_ > > > > > path component and not just the last one? Why is that generally safe? > > > > > Why can't this be used to access files and directories the caller > > > > > wouldn't otherwise be able to access? I would like to have this spelled > > > > > out for slow people like me, please. > > > > > > > > > > From my point of view, this would only be somewhat safe _generally_ if > > > > > you'd allow circumvention for revalidation and permission checking if > > > > > MNT_FORCE is specified and the caller has capable(CAP_DAC_READ_SEARCH). > > > > > You'd still mess with overlayfs permission model in this case though. > > > > > > > > > > Plus, there are better options of solving this problem. Again, I'd > > > > > rather build a separate api for unmounting then playing such potentially > > > > > subtle security sensitive games with permission checking during path > > > > > lookup. > > > > > > > > umount(2) is really a special case because the whole intent is to detach > > > > a mount from the local hierarchy and stop using it. The unfortunate bit > > > > is that it is a path-based syscall. > > > > > > > > So usually we have to: > > > > > > > > - determine the path: Maybe stat() it and to validate that it's the > > > > mountpoint we want to drop > > > > > > The stat() itself may hang because a remote server, or USB stick is > > > inaccessible or having media errors. > > > > > > I've just been having a conversation with Karel Zak to change > > > umount(1) to use statx() so that it interacts minimally with the fs. > > > > So we're talking about this commit: > > https://github.com/util-linux/util-linux/commit/42e141d20505a0deb969c2e583a463c26aadc62f > > and the patch we discussed here: > > https://github.com/util-linux/util-linux/issues/2049 > > > > > > > > In particular, nfs_getattr() skips revalidate if only minimal attrs > > > are fetched (STATX_TYPE | STATX_INO), and also skips revalidate if > > > locally-cached attrs are still valid (STATX_MODE), so this will > > > avoid yet one more place that unmount can hang. > > > > > > In theory, vfs_getattr() could get all of these attributes directly > > > from the vfs_inode in the unmount case. > > > > We don't really need that. As pointed out in that discussion and as > > Karel did we just want to pass AT_STATX_DONT_SYNC. > > > > An api that would allow unmounting by mount id can safely check and > > retrieve AT_STATX_DONT_SYNC with STATX_ATTR_MOUNT_ROOT and STATX_MNT_ID > > without ever syncing with the server. > > Unfortunately, I don't think that flag trickles down to the ->permission > checks for the pathwalk down to the point you're stat'ing. That turns > out to be a big part of the problem when trying to umount things that > are stuck down in inaccessible trees. > > I'm not opposed at all to new APIs that allow for more reliable > unmounting. I think that's a good idea, and something worth hashing out. > > But...we're stuck with umount() in perpetuity. Even if you were to merge > a new API for unmounting today, it'll be a decade or more until the > ecosystem catches up. I think we have a responsibility to make the > umount syscall work as well as we can. In this case, that means giving > it as light a footprint as possible on the pathwalk down. > > If we need to gate this behavior behind existing or new flags to > umount2(), then so be it, but lets not dismiss this idea in favor of > some new unmounting interface that may never materialize. Improving > umount has the potential to help a lot of our users. A new flag for an old system call or a new system call doesn't matter in practice for userspace. And the users that have that specific problem that's solved by a new interface will use that interface asap. That's been true for the pidfd api, that's been true for openat2(), clone3() and others. Plus, this is a workload specific problem. And even with or without a new flag it'd still need to be backported to old kernels. And just because an interface already exists doesn't mean stuff should be shoehorned into it just because it's convenient or faster. That's just asking for maintenance pain down the road. And from this discussion there's multiple ways to work around this issue currently so I especially don't see the need to rush any of this and fiddle around with permission checking. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-17 11:55 ` Christian Brauner 2023-04-17 12:25 ` Jeff Layton @ 2023-04-17 21:26 ` NeilBrown 2023-04-20 21:35 ` Al Viro 1 sibling, 1 reply; 46+ messages in thread From: NeilBrown @ 2023-04-17 21:26 UTC (permalink / raw) To: Christian Brauner Cc: Jeff Layton, Al Viro, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Mon, 17 Apr 2023, Christian Brauner wrote: > On Mon, Apr 17, 2023 at 09:13:52AM +1000, NeilBrown wrote: > > > > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > > engage with underlying systems at all. Any mount point MUST be in the > > dcache with a complete direct path from the root to the mountpoint. > > That should be sufficient to find the mountpoint given a path name. > > > > This becomes an issue when the filesystem changes unexpected, such as > > when a NFS server is changed to reject all access. It then becomes > > impossible to unmount anything mounted on the filesystem which has > > changed. We could simply lazy-unmount the changed filesystem and that > > will often be sufficient. However if the target filesystem needs > > "umount -f" to complete the unmount properly, then the lazy unmount will > > leave it incompletely unmounted. When "-f" is needed, we really need to > > I don't understand this yet. All I see is nfs_umount_begin() that's > different for MNT_FORCE to kill remaining io. Why does that preclude > MNT_DETACH? You might very well fail MNT_FORCE and the only way you can > get rid is to use MNT_DETACH, no? So I don't see why that is an > argument. Possibly I could have been more clear. If /foo is the mount point for a filesystem that is causing problems and /foo/bar is the mount point of a different filesystem, which might have other problems, then I was saying that the simple solution of umount -l /foo which would MNT_DETACH both filesystems is not ideal because there might be pending IO that will never complete, so the mount can never be cleaned up. In such cases we really want to be able to do umount -f /foo/bar and ensure that succeeds. You can see now that it isn't that MNT_FORCE precludes MNT_DETACH, but they would be done on different filesystems. MNT_FORCE is, I think, a good idea and a needed functionality that has never been implemented well. MNT_FORCE causes nfs_umount_begin to be called as you noted, which aborts all pending RPCs on that filesystem. This might be useful if you have an "ls" running, as it is likely to abort on the first getdents() failure. But if you have "ls -l' running, it will likely just go and try another lstat(), and that is just as likely to hang - thus still preventing the umount. What we used to do was find anything using the mount and kill it (sometime "fuser -k" would work). These processes would likely be stuck in a D wait, but "umount -f" would abort the RPCs and the processes would get unstuck and could respond to the signal. This would sometimes take a couple of iterations. These days we have TASK_KILLABLE so this is much less likely to be needed - we just kill all the processes and they die promptly. Most of the time. There are a few places that can still block, but which no-one has had the motivation to fix. It would be much nicer if we didn't need to kill things. Even with "fuser -k" it isn't really the sort of thing you want in a shutdown script (or in systemd-shutdown). It would be ideal if the shutdown process could call "umount" and if that fails with EBUSY, call "umount -f" and be confident of .... something. Currently "-f" might inspire hope, but not confidence. We cannot realistically make MNT_FORCE truly force a umount because processes really need to die before that happens. But we could ensure that the filesystem is quiescent and stays that way. Maybe we could add a flag to 'struct vfsmount' to say that "unmount has been forced". Any attempt to use a fd with such a vfsmnt would fail, expect close() and anything similar. Maybe nfs_umount_begin could iterate all open files on that vfsmnt and purge any cached data so no background writes were needed. I think this would be very close to what we needed. Then systemd could run umount() and if that failed then umount(MNT_FORCE) and if that fails umount(MNT_DETACH), with confidence that there would be not more RPCs generates, so it would be safe to turn off the network. BTW, that is another reason that just doing "umount -l /foo" is not ideal. We sometimes want to wait for the umount to complete, so we can remove the backing store or the network. "umount -l /foo" doesn't let us wait. "umount /foo/bar" does. > > > be able to name the target filesystem. > > > > We NEVER want to revalidate anything. We already avoid the revalidation > > of the mountpoint itself, be we won't need to revalidate anything on the > > path either as thay might affect the cache, and the cache is what we are > > really looking in. > > > > Permission checks are a little less clear. We currently allow any user > > This is all very brittle. > > First case that comes to mind is overlayfs where the permission checking > is performed twice. Once on the overlayfs inode itself based on the > caller's security context and a second time on the underlying inode with > the security context of the mounter of the overlayfs instance. > > A mounter could have dropped all privileges aside from CAP_SYS_ADMIN so > they'd be able to mount the overlayfs instance but would be restricted > from accessing certain directories or files. The task accessing the > overlayfs instance however could have a completely different security > context including CAP_DAC_READ_SEARCH and such. Both tasks could > reasonably be in different user namespaces and so on. > > The LSM hooks are also called twice and would now also be called once. I'd prefer "called never" > > It also forgets that acl_permission() check may very well call into the > filesystem via check_acl(). I didn't pay proper attention to acl_permission() - you are right. > > So umount could either be used to infer existence of files that the user > wouldn't otherwise know they exist or in the worst case allow to umount > something that they wouldn't have access to. > > Aside from that this would break userspace assumptions and as Al and > I've mentioned before in the other thread you'd need a new flag to > umount2() for this. The permission model can't just change behind users > back. > > But I dislike it for the now even more special-cased umount path lookup > alone tbh. I'd feel way more comfortable with a non-lookup related > solution that doesn't have subtle implications for permission checking. I was really hoping that existing code could be made to just work. Thanks for the review. NeilBrown > > > to make the umount syscall and perform the path lookup and only reject > > unprivileged users once the target mount point has been found. If we > > completely relax permission checks then an unprivileged user could probe > > inaccessible parts of the name space by examining the error returned > > from umount(). > > > > So we only relax permission check when the user has CAP_SYS_ADMIN > > (may_mount() succeeds). > > > > Note that if the path given is not direct and for example uses symlinks > > or "..", then dentries or symlink content may not be cached and a remote > > server could cause problem. We can only be certain of a safe unmount if > > a direct path is used. > > > > Signed-off-by: NeilBrown <neilb@suse.de> > > --- > > fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------ > > 1 file changed, 40 insertions(+), 6 deletions(-) > > > > diff --git a/fs/namei.c b/fs/namei.c > > index edfedfbccaef..f2df1adae7c5 100644 > > --- a/fs/namei.c > > +++ b/fs/namei.c > > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > > * > > * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > > */ > > -int inode_permission(struct mnt_idmap *idmap, > > - struct inode *inode, int mask) > > +int inode_permission_mp(struct mnt_idmap *idmap, > > + struct inode *inode, int mask, bool mp) > > { > > int retval; > > > > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap, > > return -EACCES; > > } > > > > - retval = do_inode_permission(idmap, inode, mask); > > + if (mp) > > + /* We are looking for a mountpoint and so don't > > + * want to interact with the filesystem at all, just > > + * the dcache and icache. > > + */ > > + retval = generic_permission(idmap, inode, mask); > > + else > > + retval = do_inode_permission(idmap, inode, mask); > > if (retval) > > return retval; > > > > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap, > > > > return security_inode_permission(inode, mask); > > } > > + > > +/** > > + * inode_permission - Check for access rights to a given inode > > + * @idmap: idmap of the mount the inode was found from > > + * @inode: Inode to check permission on > > + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > > + * > > + * Check for read/write/execute permissions on an inode. We use fs[ug]id for > > + * this, letting us set arbitrary permissions for filesystem access without > > + * changing the "normal" UIDs which are used for other things. > > + * > > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > > + */ > > +int inode_permission(struct mnt_idmap *idmap, > > + struct inode *inode, int mask) > > +{ > > + return inode_permission_mp(idmap, inode, mask, false); > > +} > > EXPORT_SYMBOL(inode_permission); > > > > /** > > @@ -589,6 +614,7 @@ struct nameidata { > > #define ND_ROOT_PRESET 1 > > #define ND_ROOT_GRABBED 2 > > #define ND_JUMPED 4 > > +#define ND_SYS_ADMIN 8 > > > > static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) > > { > > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) > > > > static inline int d_revalidate(struct dentry *dentry, unsigned int flags) > > { > > - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) > > + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) && > > + likely(!(flags & LOOKUP_MOUNTPOINT))) > > return dentry->d_op->d_revalidate(dentry, flags); > > else > > return 1; > > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name, > > static inline int may_lookup(struct mnt_idmap *idmap, > > struct nameidata *nd) > > { > > + /* If we are looking for a mountpoint and we have the SYS_ADMIN > > + * capability, then we will by-pass the filesys for permission checks > > + * and just use generic_permission(). > > + */ > > + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN); > > if (nd->flags & LOOKUP_RCU) { > > - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK); > > + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp); > > if (err != -ECHILD || !try_to_unlazy(nd)) > > return err; > > } > > - return inode_permission(idmap, nd->inode, MAY_EXEC); > > + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp); > > } > > > > static int reserve_stack(struct nameidata *nd, struct path *link) > > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, > > if (IS_ERR(name)) > > return PTR_ERR(name); > > set_nameidata(&nd, dfd, name, root); > > + if ((flags & LOOKUP_MOUNTPOINT) && may_mount()) > > + nd.state = ND_SYS_ADMIN; > > retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); > > if (unlikely(retval == -ECHILD)) > > retval = path_lookupat(&nd, flags, path); > > -- > > 2.40.0 > > > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-17 21:26 ` NeilBrown @ 2023-04-20 21:35 ` Al Viro 2023-04-20 22:01 ` NeilBrown 0 siblings, 1 reply; 46+ messages in thread From: Al Viro @ 2023-04-20 21:35 UTC (permalink / raw) To: NeilBrown Cc: Christian Brauner, Jeff Layton, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote: > MNT_FORCE is, I think, a good idea and a needed functionality that has > never been implemented well. > MNT_FORCE causes nfs_umount_begin to be called as you noted, which > aborts all pending RPCs on that filesystem. Suppose it happens to be mounted in another namespace as well. Or bound at different mountpoint, for that matter. What should MNT_FORCE do? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-20 21:35 ` Al Viro @ 2023-04-20 22:01 ` NeilBrown 2023-04-20 22:27 ` Al Viro 0 siblings, 1 reply; 46+ messages in thread From: NeilBrown @ 2023-04-20 22:01 UTC (permalink / raw) To: Al Viro Cc: Christian Brauner, Jeff Layton, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, 21 Apr 2023, Al Viro wrote: > On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote: > > > MNT_FORCE is, I think, a good idea and a needed functionality that has > > never been implemented well. > > MNT_FORCE causes nfs_umount_begin to be called as you noted, which > > aborts all pending RPCs on that filesystem. > > Suppose it happens to be mounted in another namespace as well. Or bound > at different mountpoint, for that matter. What should MNT_FORCE do? > 1/ set a "forced-unmount" flag on the vfs_mount which causes any syscall that uses the vfsmount (whether from an fd, or found in a path walk, or elsewhere), except for close(), to abort with an error; 2/ call ->umount_begin passing in the vfs_mount. The fs can abort any outstanding transaction on any fd from that vfs_mount. Possibly it might instead abort a wait rather than the whole transaction, particularly if requests using some other vfs_mount might also be interested in the transaction 3/ ->close() on a force-unmount vfs_mount would clean up without blocking indefinitely, discarding dirty data if necessary. This still depends on the application to close all fds that return errors (and to chdir out of a problematic directory). But at least it *allows* applications to do that without requiring that they be killed. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-20 22:01 ` NeilBrown @ 2023-04-20 22:27 ` Al Viro 0 siblings, 0 replies; 46+ messages in thread From: Al Viro @ 2023-04-20 22:27 UTC (permalink / raw) To: NeilBrown Cc: Christian Brauner, Jeff Layton, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Fri, Apr 21, 2023 at 08:01:09AM +1000, NeilBrown wrote: > On Fri, 21 Apr 2023, Al Viro wrote: > > On Tue, Apr 18, 2023 at 07:26:34AM +1000, NeilBrown wrote: > > > > > MNT_FORCE is, I think, a good idea and a needed functionality that has > > > never been implemented well. > > > MNT_FORCE causes nfs_umount_begin to be called as you noted, which > > > aborts all pending RPCs on that filesystem. > > > > Suppose it happens to be mounted in another namespace as well. Or bound > > at different mountpoint, for that matter. What should MNT_FORCE do? > > > > 1/ set a "forced-unmount" flag on the vfs_mount which causes any syscall > that uses the vfsmount (whether from an fd, or found in a path walk, > or elsewhere), except for close(), to abort with an error; > 2/ call ->umount_begin passing in the vfs_mount. The fs can abort any > outstanding transaction on any fd from that vfs_mount. How the hell would you recognize those? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible. 2023-04-16 23:13 ` [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible NeilBrown 2023-04-17 11:55 ` Christian Brauner @ 2023-04-17 12:09 ` Jeff Layton 1 sibling, 0 replies; 46+ messages in thread From: Jeff Layton @ 2023-04-17 12:09 UTC (permalink / raw) To: NeilBrown Cc: Al Viro, Christian Brauner, Dave Wysochanski, linux-fsdevel, linux-nfs, David Howells, Christoph Hellwig On Mon, 2023-04-17 at 09:13 +1000, NeilBrown wrote: > When performing a LOOKUP_MOUNTPOINT lookup we don't really want to > engage with underlying systems at all. Any mount point MUST be in the > dcache with a complete direct path from the root to the mountpoint. > That should be sufficient to find the mountpoint given a path name. > > This becomes an issue when the filesystem changes unexpected, such as > when a NFS server is changed to reject all access. It then becomes > impossible to unmount anything mounted on the filesystem which has > changed. We could simply lazy-unmount the changed filesystem and that > will often be sufficient. However if the target filesystem needs > "umount -f" to complete the unmount properly, then the lazy unmount will > leave it incompletely unmounted. When "-f" is needed, we really need to > be able to name the target filesyste > > We NEVER want to revalidate anything. We already avoid the revalidation > of the mountpoint itself, be we won't need to revalidate anything on the > path either as thay might affect the cache, and the cache is what we are > really looking in. > > Permission checks are a little less clear. We currently allow any user > to make the umount syscall and perform the path lookup and only reject > unprivileged users once the target mount point has been found. If we > completely relax permission checks then an unprivileged user could probe > inaccessible parts of the name space by examining the error returned > from umount(). > That sounds pretty reasonable. Most umounts are done by root in some fashion anyway. > So we only relax permission check when the user has CAP_SYS_ADMIN > (may_mount() succeeds). > > Note that if the path given is not direct and for example uses symlinks > or "..", then dentries or symlink content may not be cached and a remote > server could cause problem. We can only be certain of a safe unmount if > a direct path is used. > I guess we do still have to allow it to do a lookup due to symlinks. I think this is still worthwhile though since it'd fix a lot of these cases. > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/namei.c | 46 ++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 40 insertions(+), 6 deletions(-) > > diff --git a/fs/namei.c b/fs/namei.c > index edfedfbccaef..f2df1adae7c5 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -498,8 +498,8 @@ static int sb_permission(struct super_block *sb, struct inode *inode, int mask) > * > * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > */ > -int inode_permission(struct mnt_idmap *idmap, > - struct inode *inode, int mask) > +int inode_permission_mp(struct mnt_idmap *idmap, > + struct inode *inode, int mask, bool mp) > { > int retval; > > @@ -523,7 +523,14 @@ int inode_permission(struct mnt_idmap *idmap, > return -EACCES; > } > > - retval = do_inode_permission(idmap, inode, mask); > + if (mp) > + /* We are looking for a mountpoint and so don't > + * want to interact with the filesystem at all, just > + * the dcache and icache. > + */ > + retval = generic_permission(idmap, inode, mask); > + else > + retval = do_inode_permission(idmap, inode, mask); > if (retval) > return retval; > > @@ -533,6 +540,24 @@ int inode_permission(struct mnt_idmap *idmap, > > return security_inode_permission(inode, mask); > } > + > +/** > + * inode_permission - Check for access rights to a given inode > + * @idmap: idmap of the mount the inode was found from > + * @inode: Inode to check permission on > + * @mask: Right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC) > + * > + * Check for read/write/execute permissions on an inode. We use fs[ug]id for > + * this, letting us set arbitrary permissions for filesystem access without > + * changing the "normal" UIDs which are used for other things. > + * > + * When checking for MAY_APPEND, MAY_WRITE must also be set in @mask. > + */ > +int inode_permission(struct mnt_idmap *idmap, > + struct inode *inode, int mask) > +{ > + return inode_permission_mp(idmap, inode, mask, false); > +} > EXPORT_SYMBOL(inode_permission); > > /** > @@ -589,6 +614,7 @@ struct nameidata { > #define ND_ROOT_PRESET 1 > #define ND_ROOT_GRABBED 2 > #define ND_JUMPED 4 > +#define ND_SYS_ADMIN 8 > > static void __set_nameidata(struct nameidata *p, int dfd, struct filename *name) > { > @@ -853,7 +879,8 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry) > > static inline int d_revalidate(struct dentry *dentry, unsigned int flags) > { > - if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) > + if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE) && > + likely(!(flags & LOOKUP_MOUNTPOINT))) > return dentry->d_op->d_revalidate(dentry, flags); > else > return 1; > @@ -1708,12 +1735,17 @@ static struct dentry *lookup_slow(const struct qstr *name, > static inline int may_lookup(struct mnt_idmap *idmap, > struct nameidata *nd) > { > + /* If we are looking for a mountpoint and we have the SYS_ADMIN > + * capability, then we will by-pass the filesys for permission checks > + * and just use generic_permission(). > + */ > + bool mp = (nd->flags & LOOKUP_MOUNTPOINT) && (nd->state & ND_SYS_ADMIN); > if (nd->flags & LOOKUP_RCU) { > - int err = inode_permission(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK); > + int err = inode_permission_mp(idmap, nd->inode, MAY_EXEC|MAY_NOT_BLOCK, mp); > if (err != -ECHILD || !try_to_unlazy(nd)) > return err; > } > - return inode_permission(idmap, nd->inode, MAY_EXEC); > + return inode_permission_mp(idmap, nd->inode, MAY_EXEC, mp); > } > > static int reserve_stack(struct nameidata *nd, struct path *link) > @@ -2501,6 +2533,8 @@ int filename_lookup(int dfd, struct filename *name, unsigned flags, > if (IS_ERR(name)) > return PTR_ERR(name); > set_nameidata(&nd, dfd, name, root); > + if ((flags & LOOKUP_MOUNTPOINT) && may_mount()) > + nd.state = ND_SYS_ADMIN; > retval = path_lookupat(&nd, flags | LOOKUP_RCU, path); > if (unlikely(retval == -ECHILD)) > retval = path_lookupat(&nd, flags, path); This behavior looks right along the lines of what I was thinking. Just for bisectability, it might be worthwhile to break this up along conceptual lines: one patch to make it skip d_revalidate, one that changes the permission checking, etc. I'll plan to give this a try soon with Dave's reproducer. Thanks! -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2023-04-20 22:27 UTC | newest] Thread overview: 46+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-04-13 22:00 allowing for a completely cached umount(2) pathwalk Jeff Layton 2023-04-13 22:25 ` Andreas Dilger 2023-04-13 22:41 ` NeilBrown 2023-04-14 2:43 ` Al Viro 2023-04-14 3:28 ` Trond Myklebust 2023-04-14 3:51 ` Al Viro 2023-04-14 4:06 ` Trond Myklebust 2023-04-14 4:21 ` Al Viro 2023-04-14 9:41 ` Christian Brauner 2023-04-14 10:09 ` Jeff Layton 2023-04-14 11:16 ` Christian Brauner 2023-04-14 12:33 ` Jeff Layton 2023-04-14 12:51 ` Christian Brauner 2023-04-15 9:51 ` Amir Goldstein 2023-04-14 10:06 ` Jeff Layton 2023-04-14 13:41 ` Christian Brauner 2023-04-14 14:21 ` Trond Myklebust 2023-04-14 15:13 ` Christian Brauner 2023-04-14 15:30 ` Trond Myklebust 2023-04-14 15:57 ` Trond Myklebust 2023-04-14 16:22 ` Al Viro 2023-04-14 16:41 ` Trond Myklebust 2023-04-14 19:01 ` Benjamin Coddington 2023-04-17 8:22 ` Christian Brauner 2023-04-14 16:32 ` Christian Brauner 2023-04-14 2:32 ` Al Viro 2023-04-14 10:01 ` Jeff Layton 2023-04-14 12:18 ` Christian Brauner 2023-04-14 14:57 ` Al Viro 2023-04-14 13:16 ` David Wysochanski 2023-04-16 23:13 ` [PATCH/RFC] VFS: LOOKUP_MOUNTPOINT should used cached info whenever possible NeilBrown 2023-04-17 11:55 ` Christian Brauner 2023-04-17 12:25 ` Jeff Layton 2023-04-17 14:24 ` Christian Brauner 2023-04-17 15:21 ` Jeff Layton 2023-04-17 21:34 ` NeilBrown 2023-04-18 8:10 ` Christian Brauner 2023-04-18 3:25 ` Andreas Dilger 2023-04-18 8:04 ` Christian Brauner 2023-04-20 13:05 ` Jeff Layton 2023-04-20 15:41 ` Christian Brauner 2023-04-17 21:26 ` NeilBrown 2023-04-20 21:35 ` Al Viro 2023-04-20 22:01 ` NeilBrown 2023-04-20 22:27 ` Al Viro 2023-04-17 12:09 ` Jeff Layton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox