* [REVIEW][PATCH 0/3] userns fixes for v3.13-rc1 [not found] ` <20131126181043.GA25492-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> @ 2013-11-27 0:14 ` Eric W. Biederman [not found] ` <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2013-11-27 0:14 UTC (permalink / raw) To: Serge E. Hallyn Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA If folks will take a look I would appreciate it. This is my pile of user namespace fixes against v3.13-rc1. All of these are minor regressions that should be backported to v3.12. Unfortunately they were discovered late enough and I was running sick that I haven't had a time to get to them earlier. Eric W. Biederman (3): vfs: In d_path don't call d_dname on a mount point fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) vfs: Fix a regression in mounting proc fs/dcache.c | 7 ++++++- fs/namespace.c | 2 +- kernel/fork.c | 2 +- 3 files changed, 8 insertions(+), 3 deletions(-) ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point [not found] ` <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-11-27 0:16 ` Eric W. Biederman 2013-11-27 1:58 ` Serge E. Hallyn [not found] ` <8738mi1yya.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 0:16 ` [REVIEW][PATCH 2/3] fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) Eric W. Biederman 2013-11-27 0:17 ` [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc Eric W. Biederman 2 siblings, 2 replies; 41+ messages in thread From: Eric W. Biederman @ 2013-11-27 0:16 UTC (permalink / raw) To: Serge E. Hallyn Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org) wrote: > Commit bf056bfa80596a5d14b26b17276a56a0dcb080e5: > "proc: Fix the namespace inode permission checks." converted > the namespace files into symlinks. The same commit changed > the way namespace bind mounts appear in /proc/mounts: > $ mount --bind /proc/self/ns/ipc /mnt/ipc > Originally: > $ cat /proc/mounts | grep ipc > proc /mnt/ipc proc rw,nosuid,nodev,noexec 0 0 > > After commit bf056bfa80596a5d14b26b17276a56a0dcb080e5: > $ cat /proc/mounts | grep ipc > proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0 > > This breaks userspace which expects the 2nd field in > /proc/mounts to be a valid path. The symlink /proc/<pid>/ns/{ipc,mnt,net,pid,user,uts} point to dentries allocated with d_alloc_pseudo that we can mount, and that have interesting names printed out with d_dname. When these files are bind mounted /proc/mounts is not currently displaying the mount point correctly because d_dname is called instead of just displaying the path where the file is mounted. Solve this by adding an explicit check to distinguish mounted pseudo inodes and unmounted pseudo inodes. Unmounted pseudo inodes always use mount of their filesstem as the mnt_root in their path making these two cases easy to distinguish. CC: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Reported-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/dcache.c | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 4bdb300b16e2..f7282ebf1a37 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -3061,8 +3061,13 @@ char *d_path(const struct path *path, char *buf, int buflen) * thus don't need to be hashed. They also don't need a name until a * user wants to identify the object in /proc/pid/fd/. The little hack * below allows us to generate a name for these objects on demand: + * + * Some pseudo inodes are mountable. When they are mounted + * path->dentry == path->mnt->mnt_root. In that case don't call d_dname + * and instead have d_path return the mounted path. */ - if (path->dentry->d_op && path->dentry->d_op->d_dname) + if (path->dentry->d_op && path->dentry->d_op->d_dname && + (!IS_ROOT(path->dentry) || path->dentry != path->mnt->mnt_root)) return path->dentry->d_op->d_dname(path->dentry, buf, buflen); rcu_read_lock(); -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point 2013-11-27 0:16 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Eric W. Biederman @ 2013-11-27 1:58 ` Serge E. Hallyn [not found] ` <8738mi1yya.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 1 sibling, 0 replies; 41+ messages in thread From: Serge E. Hallyn @ 2013-11-27 1:58 UTC (permalink / raw) To: Eric W. Biederman Cc: Serge E. Hallyn, Gao feng, Containers, linux-fsdevel, Aditya Kali, Oleg Nesterov, Andy Lutomirski Quoting Eric W. Biederman (ebiederm@xmission.com): > > Aditya Kali (adityakali@google.com) wrote: > > Commit bf056bfa80596a5d14b26b17276a56a0dcb080e5: > > "proc: Fix the namespace inode permission checks." converted > > the namespace files into symlinks. The same commit changed > > the way namespace bind mounts appear in /proc/mounts: > > $ mount --bind /proc/self/ns/ipc /mnt/ipc > > Originally: > > $ cat /proc/mounts | grep ipc > > proc /mnt/ipc proc rw,nosuid,nodev,noexec 0 0 > > > > After commit bf056bfa80596a5d14b26b17276a56a0dcb080e5: > > $ cat /proc/mounts | grep ipc > > proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0 > > > > This breaks userspace which expects the 2nd field in > > /proc/mounts to be a valid path. > > The symlink /proc/<pid>/ns/{ipc,mnt,net,pid,user,uts} point to > dentries allocated with d_alloc_pseudo that we can mount, and > that have interesting names printed out with d_dname. > > When these files are bind mounted /proc/mounts is not currently > displaying the mount point correctly because d_dname is called instead > of just displaying the path where the file is mounted. > > Solve this by adding an explicit check to distinguish mounted pseudo > inodes and unmounted pseudo inodes. Unmounted pseudo inodes always > use mount of their filesstem as the mnt_root in their path making > these two cases easy to distinguish. > > CC: stable@vger.kernel.org > Reported-by: Aditya Kali <adityakali@google.com> > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> > --- > fs/dcache.c | 7 ++++++- > 1 files changed, 6 insertions(+), 1 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index 4bdb300b16e2..f7282ebf1a37 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -3061,8 +3061,13 @@ char *d_path(const struct path *path, char *buf, int buflen) > * thus don't need to be hashed. They also don't need a name until a > * user wants to identify the object in /proc/pid/fd/. The little hack > * below allows us to generate a name for these objects on demand: > + * > + * Some pseudo inodes are mountable. When they are mounted > + * path->dentry == path->mnt->mnt_root. In that case don't call d_dname > + * and instead have d_path return the mounted path. > */ > - if (path->dentry->d_op && path->dentry->d_op->d_dname) > + if (path->dentry->d_op && path->dentry->d_op->d_dname && > + (!IS_ROOT(path->dentry) || path->dentry != path->mnt->mnt_root)) > return path->dentry->d_op->d_dname(path->dentry, buf, buflen); > > rcu_read_lock(); > -- > 1.7.5.4 ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <8738mi1yya.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point [not found] ` <8738mi1yya.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-11-30 6:15 ` Al Viro [not found] ` <20131130061525.GY10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 0 siblings, 1 reply; 41+ messages in thread From: Al Viro @ 2013-11-30 6:15 UTC (permalink / raw) To: Eric W. Biederman Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds [Eric Dumazet Cc'd since ->d_dname() had been introduced by him] On Tue, Nov 26, 2013 at 04:16:13PM -0800, Eric W. Biederman wrote: > > proc ipc:[4026531839] proc rw,nosuid,nodev,noexec 0 0 > > > > This breaks userspace which expects the 2nd field in > > /proc/mounts to be a valid path. > > The symlink /proc/<pid>/ns/{ipc,mnt,net,pid,user,uts} point to > dentries allocated with d_alloc_pseudo that we can mount, and > that have interesting names printed out with d_dname. > > When these files are bind mounted /proc/mounts is not currently > displaying the mount point correctly because d_dname is called instead > of just displaying the path where the file is mounted. > > Solve this by adding an explicit check to distinguish mounted pseudo > inodes and unmounted pseudo inodes. Unmounted pseudo inodes always > use mount of their filesstem as the mnt_root in their path making > these two cases easy to distinguish. Excuse me, but this is a wrong way to deal with that. The root of the problem is your "let's give them ->d_dname() *and* let them live on the same procfs vfsmount". The latter, BTW, requires the suckers to access nd->path.mnt, which is a bad thing by itself. Plus the wrong way ->d_dname() is called. It was kinda-sorta defensible back when it had been introduced, but only because all filesystems with ->d_dname() had been MS_NOUSER ones. Kinda-sorta since it introduced an unpleasant difference in semantics of d_path() and the rest of its ilk - see e.g. tomoyo calling ->d_dname() manually. ->d_dname() addresses a real need, no arguments here - we do want to delay name generation for pipes and sockets until asked for it. The problem is that it's dealt with in the wrong place - we ought to have it handled when prepend_path() reaches a vfsmount with no parent and an IS_ROOT() dentry. _Then_ we need to check if it has ->d_op->d_dname() and call one if it does. There's a bunch of unpleasant details around the handling of "what if the final vfsmount is detached or internal" (and induced weirdness with d_absolute_path(), etc. callers deciding whether they want to call ->d_dname() manually, since only d_path() calls it); I'll look into that. As for using this procfs vfsmount... Frankly, I would rather add a mini-fs a-la aio one and have those inodes live _there_, with the same vfsmount used for all those guys. Internal, of course. Without bothering with register_filesystem(), with all associated headache ("what if userland mounts the entire thing", etc.) ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <20131130061525.GY10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>]
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point [not found] ` <20131130061525.GY10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> @ 2013-11-30 17:02 ` Al Viro [not found] ` <20131130170226.GZ10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 0 siblings, 1 reply; 41+ messages in thread From: Al Viro @ 2013-11-30 17:02 UTC (permalink / raw) To: Eric W. Biederman Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds On Sat, Nov 30, 2013 at 06:15:26AM +0000, Al Viro wrote: > There's a bunch of unpleasant details around the handling of "what if > the final vfsmount is detached or internal" (and induced weirdness > with d_absolute_path(), etc. callers deciding whether they want to > call ->d_dname() manually, since only d_path() calls it); I'll look into > that. FWIW, the other callers of prepend_path() boil down to /proc/mountinfo handling, apparmour d_namespace_path() (separate handling of MNT_INTERNAL, __d_path() or d_absolute_path() for the rest) and tomoyo_get_absolute_path() (this one directly calls ->d_dname() itself). Note that /proc/mountinfo will spew garbage for your case (binding /proc/<pid>/ns/mnt somewhere); the mountpoint will show correctly, but the relative name won't - it goes through seq_dentry()->dentry_path() (this and apparmour d_namespace_path() being the only codepaths to dentry_path(), BTW). Eric, what behaviour do you want there? While we are at it: lustre contains this gem: static char* ll_d_path(struct dentry *dentry, char *buf, int bufsize) { char *path = NULL; struct path p; p.dentry = dentry; p.mnt = current->fs->root.mnt; path_get(&p); path = d_path(&p, buf, bufsize); path_put(&p); return path; } That should've been dentry_path(), reimplemented here with the usual braino. Think what this hack produces if current is chrooted into fs in question; the "clever" trick fails and instead of intended path from fs root we get path from wherever current's chrooted to. There's a reason why dentry_path() had been introduced... <looks at the callers of that wonder> /* this can be called inside spin lock so use GFP_ATOMIC. */ buf = (char *)__get_free_page(GFP_ATOMIC); if (buf != NULL) { dentry = d_find_alias(page->mapping->host); if (dentry != NULL) path = ll_d_path(dentry, buf, PAGE_SIZE); } ... if (dentry) dput(dentry); Good luck if it ever gets called under spinlock - dput() is not a thing you should call in such place, ditto for path_put() from ll_d_path() itself... What are the callchains leading there? I've tried to track them, but gave up after a while ;-/ I really hope that it's just "called under spinlock" and not "called from softirq" or something like that... Who maintains that thing, anyway? BTW, what happens if svc_export_request() ends up with pathname filling almost all space left, so that qword_add(bpp, blen, pth) right after the call of d_path() in there overwrites the beginning of d_path() output? Neil? And while we are at it, handling of overflow in there looks also looks fishy... ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <20131130170226.GZ10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>]
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point [not found] ` <20131130170226.GZ10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> @ 2013-11-30 21:51 ` Eric W. Biederman [not found] ` <87a9glh838.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-12-01 5:09 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Al Viro 2013-12-02 5:43 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point NeilBrown 2 siblings, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2013-11-30 21:51 UTC (permalink / raw) To: Al Viro Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes: > Eric, what behaviour do you want there? Here is the behavior I want. a) Something reasonable in /proc/self/fd when I just open one of these. root@unassigned:~# exec 5< /proc/self/ns/net root@unassigned:~# ls -l /proc/self/fd total 0 lrwx------ 1 root root 64 Nov 30 13:12 0 -> /dev/ttyS0 lrwx------ 1 root root 64 Nov 30 13:12 1 -> /dev/ttyS0 lrwx------ 1 root root 64 Nov 30 13:12 2 -> /dev/ttyS0 lr-x------ 1 root root 64 Nov 30 13:12 3 -> /proc/718/fd lr-x------ 1 root root 64 Nov 30 13:12 5 -> net:[4026531955] root@unassigned:~# ip netns add foo b) Something reasonable in /proc/mounts when I bind mount one of these. root@unassigned:~# ip netns add foo root@unassigned:~# cat /proc/mounts [...] proc /var/run/netns/foo proc rw,nosuid,nodev,noexec,relatime 0 0 The technical path that lead me to changing d_path looks something like this. - I want bind mounts to work. So I need check_mnt to pass. - I can't have these file descriptors show up directly in proc or useful permission checks will be bypassed. Therefore d_unhashed must be true. - I don't want every call of d_path to add (deleted) so I need d_unlinked to be false. Therefore S_ISROOT must be true. - When looking at one of these file descriptors when they are not mounted prepend_path see that IS_ROOT is true and report /proc/ unless I implement d_name. - When looking at one of these file descriptors when they are bind mounted I want /proc/mounts to reflect where they are mounted, which means I shouldn't call d_path. Any dentry allocated with d_alloc_pseudo will have the same problem if it is ever in a context where it can be bind mounted. So this is a general issue with d_name. The other path to where I am at starts at this comment in prepend_path: /* * Filesystems needing to implement special "root names" * should do so with ->d_dname() */ Which is exactly what I did only later to discover that d_path get's it wrong if the dentry is mounted. If I remove ns_dname I get the following: root@unassigned:~# exec 5< /proc/self/ns/net root@unassigned:~# ls -l /proc/self/fd/ total 0 lrwx------ 1 root root 64 Nov 30 13:31 0 -> /dev/ttyS0 lrwx------ 1 root root 64 Nov 30 13:31 1 -> /dev/ttyS0 lrwx------ 1 root root 64 Nov 30 13:31 2 -> /dev/ttyS0 lr-x------ 1 root root 64 Nov 30 13:31 3 -> /proc/708/fd lr-x------ 1 root root 64 Nov 30 13:31 5 -> /proc Which is livable but particularly ugly to look at. As for simple correctness. There are only a handful of implementations of d_dname today and the all set a valid path in the file descriptors returned, in anything that can make it to d_path. Which makes the test for a psuedo dentry in d_path safe. And at a very simply level we never want to call d_dname on an actual mount point. So in the small I will argue what I have done is very much correct. In the large I don't see any other set of implementation choices that does not result in a significant redesign of something. Furthermore I have a regression dating back a couple of kernels that breaks /proc/mounts that should be resolved, and this patch is the cleanest, safest, simplest solution I can see. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <87a9glh838.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point [not found] ` <87a9glh838.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-11-30 22:43 ` Al Viro [not found] ` <20131130224340.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2014-01-17 3:29 ` Eric W. Biederman 0 siblings, 2 replies; 41+ messages in thread From: Al Viro @ 2013-11-30 22:43 UTC (permalink / raw) To: Eric W. Biederman Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds On Sat, Nov 30, 2013 at 01:51:07PM -0800, Eric W. Biederman wrote: > Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> writes: > > > Eric, what behaviour do you want there? > > Here is the behavior I want. > > a) Something reasonable in /proc/self/fd when I just open one of these. > > root@unassigned:~# exec 5< /proc/self/ns/net > root@unassigned:~# ls -l /proc/self/fd > total 0 > lrwx------ 1 root root 64 Nov 30 13:12 0 -> /dev/ttyS0 > lrwx------ 1 root root 64 Nov 30 13:12 1 -> /dev/ttyS0 > lrwx------ 1 root root 64 Nov 30 13:12 2 -> /dev/ttyS0 > lr-x------ 1 root root 64 Nov 30 13:12 3 -> /proc/718/fd > lr-x------ 1 root root 64 Nov 30 13:12 5 -> net:[4026531955] > root@unassigned:~# ip netns add foo > > b) Something reasonable in /proc/mounts when I bind mount one of these. > > root@unassigned:~# ip netns add foo > root@unassigned:~# cat /proc/mounts > [...] > proc /var/run/netns/foo proc rw,nosuid,nodev,noexec,relatime 0 0 Sigh... What do you want to see in /proc/self/mountinfo? > Any dentry allocated with d_alloc_pseudo will have the same problem if > it is ever in a context where it can be bind mounted. So this is a > general issue with d_name. Yes, ->d_dname() is called in a wrong place. It should be in prepend_path(), if not deeper. What you are doing is growing a kludge on top of that mistake. Note that your patch does nothing for mountinfo. As for redesign... That's exactly what is needed in that area, instead of letting it fester. Just look at the rats' nest in there - we have a maze of helper functions, all alike, with slightly varying semantics. With prepend_path() having oh-so-wonderful calling conventions (0/-ENAMETOOLONG/1/2 for return value with rather opaque callers). Outside of fs/dcache.c we have the following pile: d_path() - most of the callers, periodically misused (see lustre bug upthread) d_absolute_path(), __d_path(), dentry_path() - very few callers, most in apparmour and tomoyo attempts to implement more or less the same kind of thing. Plus handling of /proc/<pid>/mountinfo. As for check_mnt(old) in do_loopback()... I wonder if we shouldn't just turn that puppy into (check_mnt(old) || (old->mnt.mnt_flags & MNT_INTERNAL)). Note that MS_NOUSER in superblock flags would prevent binding pipefs and all mount_pseudo() users, so we wouldn't change existing behaviour... ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <20131130224340.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>]
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point [not found] ` <20131130224340.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> @ 2013-12-02 7:29 ` Al Viro 0 siblings, 0 replies; 41+ messages in thread From: Al Viro @ 2013-12-02 7:29 UTC (permalink / raw) To: Eric W. Biederman Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds BTW, this "just use nd->path.mnt" is *not* guaranteed to work - sadly, it's possible to get a _symlink_ bound somewhere. I'm not sure whether we should allow that or not, but as it is that's doable. And with your ->follow_link() that means ending up with (vfsmount,dentry) pair that has dentry completely unrelated to vfsmount. The thing is, nd->path is where we would start the traversal for a normal symlink, i.e. the directory where that thing had been found. Which almost always gets covered by the vfsmount as the symlink itself, with one exception - when that symlink is bound there. _Normally_ namei can cope with that just fine - if that's a normal symlink, everything ends up working and we only have to take a bit of care about link->mnt in fs/namei.c:follow_link(). With procfs-style symlinks it also happens to work. With one exception - /proc/<pid>/ns/* Hell knows; my preference would be to ban that kind of crap, especially since it comes from a rather convoluted loophole... BTW, Linus, remember the thread back in March when I was grumbling about procfs-symlinks-to-symlinks? We ended up killing that BUG_ON() in nd_jump_link() and back then I hadn't found sufficiently convincing breakage to require failing such suckers with -ELOOP. Looks like now I have a reasonably good argument - pathname lookups with LOOKUP_FOLLOW should _not_ yield symlinks. That's the loophole refered to above... If you have happily flushed that memory, the problem was that following a procfs symlink gives us location in the tree, but that's not enough when that location itself happens to be a symlink; to resolve a normal symlink we need both the symlink itself *and* where had it been found. When we'd arrived to that symlink by following a procfs-style symlink, the second part is missing. Now, normally that's not an issue - to hit that case you need O_PATH|O_NOFOLLOW open of a symlink (giving you an O_PATH descriptor of symlink itself) and try to follow /proc/<pid>/fd/<that descriptor>. No other way to get that kind of symlink-to-symlink in procfs. I argued for ELOOP on following those guys; you replied with <quote> Why? That's what I did last week, it seemed to be the RightThing(tm) to do. It somebody has opened a symlink, opening that again through /proc gives you the same symlink. That would be the natural semantics, no? You do *not* want to follow it any further, afaik. </quote> The trouble is, there is code assuming that LOOKUP_FOLLOW pathname resolution will stop at non-symlink or fail. One such example is do_loopback(), aka. mount --bind old new, which uses LOOKUP_FOLLOW to resolve old. However, using the trick above (open a symlink with O_PATH|O_NOFOLLOW, then pass /proc/self/fd/<descriptor> to mount(2) with MS_BIND in flags) you can actually have it yield a symlink. It mostly works (Eric's proc_ns_follow_link() is actually the only instance that has problems with that setup), but it's certainly unexpected. And I'm not at all sure that we have no other place using LOOKUP_FOLLOW and breaking if what it gets is a symlink... I can buy your argument about opening them (O_PATH open of /proc/self/fd/something yields O_PATH descriptor of whatever's opened there, and if it's a symlink, so be it - you've asked for it, you've got it), but I would like to have path_lookupat() with LOOKUP_FOLLOW fail if it's ended up in a symlink. With ELOOP, probably. I agree that nd_jump_link() is not a good place for that; I would add if (!err && nd->flags & LOOKUP_FOLLOW) { if (unlikely(d_is_symlink(nd->path.dentry))) { path_put(&nd->path); err = -ELOOP; } } next to similar check for LOOKUP_DIRECTORY in path_lookupat(). Objections? ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point 2013-11-30 22:43 ` Al Viro [not found] ` <20131130224340.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> @ 2014-01-17 3:29 ` Eric W. Biederman [not found] ` <874n53gub7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2014-01-17 3:29 UTC (permalink / raw) To: Al Viro Cc: Serge E. Hallyn, Gao feng, Containers, linux-fsdevel, Aditya Kali, Oleg Nesterov, Andy Lutomirski, Linus Torvalds, Eric Dumazet, Neil Brown Al Viro <viro@ZenIV.linux.org.uk> writes: > On Sat, Nov 30, 2013 at 01:51:07PM -0800, Eric W. Biederman wrote: >> Al Viro <viro@ZenIV.linux.org.uk> writes: >> >> > Eric, what behaviour do you want there? >> >> Here is the behavior I want. >> >> a) Something reasonable in /proc/self/fd when I just open one of these. >> >> root@unassigned:~# exec 5< /proc/self/ns/net >> root@unassigned:~# ls -l /proc/self/fd >> total 0 >> lrwx------ 1 root root 64 Nov 30 13:12 0 -> /dev/ttyS0 >> lrwx------ 1 root root 64 Nov 30 13:12 1 -> /dev/ttyS0 >> lrwx------ 1 root root 64 Nov 30 13:12 2 -> /dev/ttyS0 >> lr-x------ 1 root root 64 Nov 30 13:12 3 -> /proc/718/fd >> lr-x------ 1 root root 64 Nov 30 13:12 5 -> net:[4026531955] >> root@unassigned:~# ip netns add foo >> >> b) Something reasonable in /proc/mounts when I bind mount one of these. >> >> root@unassigned:~# ip netns add foo >> root@unassigned:~# cat /proc/mounts >> [...] >> proc /var/run/netns/foo proc rw,nosuid,nodev,noexec,relatime 0 0 > > Sigh... What do you want to see in /proc/self/mountinfo? With or without my patch I see what I am after right now is: 31 12 0:3 / /var/run/netns/foo rw,nosuid,nodev,noexec,relatime - proc proc rw Apologies for not being clear on that point. However it would be nice to see: 28 13 0:3 net:[4026532127] /var/run/netns/foo rw,nosuid,nodev,noexec,relatime - proc proc rw Not having the path be the magical floating path is not causing regressions for people so I don't care much at the moment. >> Any dentry allocated with d_alloc_pseudo will have the same problem if >> it is ever in a context where it can be bind mounted. So this is a >> general issue with d_name. > > Yes, ->d_dname() is called in a wrong place. It should be in prepend_path > if not deeper. What you are doing is growing a kludge on top of that > mistake. I disagree. I am not growing a kludge on of a mistake. I am refining the logic at the call site of d_dname. The logic to not call d_dname on a mountpoint should be there wherever we call d_dname. Which makes my change completely orthogonal to moving the call of d_dname into the strangeness that is prepend_path. > Note that your patch does nothing for mountinfo. And it doesn't need to do anything for mountinfo. My desire and the only sensible semantics I can think of is for non-mountpoints to have d_dname called on them. By definition /proc/mountinfo only shows mount points, so there is never a neeed to call d_dpath is /proc/mountinfo. Looking at what is going on in detail, the implementations of the .d_dname method are: arch/ia64/kernel/perfmon.c: pfmfs_dname ia64 perfmon files are unlinked character devices whose dentries are allocated with d_alloc, and the filesystem is mounted with kern_mount. fs/aio.c:simple_dname aio private files that are non-directories whose dentries are allocated with d_alloc_pseudo, and the filesystem is mounted with kern_mount. fs/anon_inodes.c:anon_inodefs_dname anonymous inodes that are non-directories whose dentries are allocated with d_alloc_pseudo, and the filesystem is mounted with kern_mount. fs/pipe.c:pipefs_dname pipes that are non-directories whose dentries are allocated with d_alloc_psuedo, and the fs is mounted with kern_mount. net/socket.c:sockfs_dname sockets are non-directories whose dentries are allocated with d_alloc_pseudo, and the filesystem is mounted with kern_mount. mm/shmem.c:simple_dname shared memory segments are non-directories whose dentries are allocated with d_alloc_pseudo, and the filesystem is mounted with kern_mount. fs/hugetlbfs/inode.c:simple_dname hugetlb shared memory segments are non-directories whose dentries are allocted with d_alloc_pseudo, and the filesystem is mounted with kern_mount_data. fs/proc/namespaces.c:ns_dname namespace files are non-directories whose dentries are allocated with d_alloc_pseudo, and the filesystem is mounted normally. In the worst case prepend_path is safe to use on the dentries that implement d_dname as their dentry name field has valid data. The callers of prepend_path that are not d_path are: __d_path d_absolute_path getcwd getcwd need not be worried about because get_fs_root_and_pwd will always return directories. And none of the implementations of d_dname are directories. __d_path is has only two callers: fs/seq_file.c:seq_path_root and security/apparmor/path.c:d_namespace_path. fs/seq_file.c: seq_path_root which is only used in fs/proc_namespaces.c:show_mountinfo aka /proc/mountinfo. Since this is only dealing with mounted paths there will never be a need to call d_dname. security/apparmor/path.c:d_namespace_path in apparmor is almost interesting, it is called on files that come from exec (brpm->file), link, rename, open, unlink, create, truncate, chown, chmod, and getattr. Which with clever usage of magic proc symlinks could conceivably point to a file descriptor that implements d_dname. However d_namespace_path tests the mountpoint for MNT_INTERNAL and calls dentry_path if that is the case, avoiding __d_path or d_absolute_path, except for the proc namespace files. In either case today if mounted a sensible result and if not mounted an empty path. The only other caller of d_absolute_path is security/tomoyo/realpath.c:tomoyo_get_absolute_path. However tomyo_get_absolute_path is only called from tomoyo_realpath_from_path which checks for dentries that implement d_dname and calls them explicitly, only calling d_absolute_name if there is no such implementation. Which is a long way of saying that right now moving the call of d_dname into prepend_path would make no practical difference. At a practical level there are improvements to be had by calling d_dname in dentry_path and by adding my avoidance of calling d_dname on a mountpoint into tomoyo_realpath_from_path. So while I see what you mean by prepend_path needing cleaning up that is really orthogonal to the acidental regression that I am fixing. > As for redesign... That's exactly what is needed in that area, instead of > letting it fester. I can't backport a redesign to fix the regression, and a redesign results in no practical benefit for me. So I might be persuaded later if I can find the time but right now a redesign is the wrong place to start. > As for check_mnt(old) in do_loopback()... I wonder if we shouldn't just > turn that puppy into (check_mnt(old) || (old->mnt.mnt_flags & MNT_INTERNAL)). > Note that MS_NOUSER in superblock flags would prevent binding pipefs and > all mount_pseudo() users, so we wouldn't change existing behaviour... Strictly speaking behavior would change for the proc namespace files, as now you could do things by finding a mount of proc in another mount namespace where the namespace files were available. I would be a lot more comfortable with changing do_loopback if we could safely remove the check_mnt(old) test altogether. Why do we have the check_mnt(old) test in do_loopback? If we can't remove the check_mnt(old) test I need verify that the reasons for that test don't apply to my namespace file descriptor case. Otherwise I can't see any real problems with making that change (time permitting). After getting this regression fix merged, the windmill that I expect I will be tilting at are fixing the user space visible races that let a dentry be renamed while we are mounting something on it, and the messy fact that check_submounts_and_drop isn't part of d_invalidate. Both of those can potentially result in user space visible insantiy. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <874n53gub7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point [not found] ` <874n53gub7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2014-01-17 8:39 ` Al Viro [not found] ` <20140117083901.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 0 siblings, 1 reply; 41+ messages in thread From: Al Viro @ 2014-01-17 8:39 UTC (permalink / raw) To: Eric W. Biederman Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds On Thu, Jan 16, 2014 at 07:29:16PM -0800, Eric W. Biederman wrote: > However it would be nice to see: > 28 13 0:3 net:[4026532127] /var/run/netns/foo rw,nosuid,nodev,noexec,relatime - proc proc rw > > Not having the path be the magical floating path is not causing > regressions for people so I don't care much at the moment. > >> Any dentry allocated with d_alloc_pseudo will have the same problem if > >> it is ever in a context where it can be bind mounted. So this is a > >> general issue with d_name. > > > > Yes, ->d_dname() is called in a wrong place. It should be in prepend_path > > if not deeper. What you are doing is growing a kludge on top of that > > mistake. > > I disagree. I am not growing a kludge on of a mistake. I am refining > the logic at the call site of d_dname. The logic to not call d_dname > on a mountpoint should be there wherever we call d_dname. What, in your opinion does the word "kludge" mean? The usual meaning is "a change that might work, but depends on too many easy-to-break accidental details of the current construction". If you want to use a different term, fine, but the problem doesn't disappear from renaming things. I agree that your patch doesn't break things and I've said so from the very beginning. [snip the obvious analysis] > At a practical level there are improvements to be had by calling d_dname > in dentry_path and by adding my avoidance of calling d_dname on a > mountpoint into tomoyo_realpath_from_path. ... i.e. making things even more convoluted. Congratulations, that's just what we need. > So while I see what you mean by prepend_path needing cleaning up that is > really orthogonal to the acidental regression that I am fixing. > > > As for redesign... That's exactly what is needed in that area, instead of > > letting it fester. > > I can't backport a redesign to fix the regression, and a redesign > results in no practical benefit for me. So I might be persuaded later > if I can find the time but right now a redesign is the wrong place to > start. Gotta love the style, but... I'm not persuaded by the above. At all. > > As for check_mnt(old) in do_loopback()... I wonder if we shouldn't just > > turn that puppy into (check_mnt(old) || (old->mnt.mnt_flags & MNT_INTERNAL)). > > Note that MS_NOUSER in superblock flags would prevent binding pipefs and > > all mount_pseudo() users, so we wouldn't change existing behaviour... > > Strictly speaking behavior would change for the proc namespace files, as > now you could do things by finding a mount of proc in another mount > namespace where the namespace files were available. Er... Yes, and? If you can do that, you can bloody well just open them there and then do exact same thing. The visible change would be different - it would be the ability to bind hugetlbfs and shmem anon mappings via /proc/*/map_files/* > I would be a lot more comfortable with changing do_loopback if we could > safely remove the check_mnt(old) test altogether. Why do we have the > check_mnt(old) test in do_loopback? If we can't remove the > check_mnt(old) test I need verify that the reasons for that test don't > apply to my namespace file descriptor case. Otherwise I can't see any > real problems with making that change (time permitting). Sigh... You do realize that right now your proc_ns_follow_link() violates a pretty basic assert: nd->path.mnt->mnt_sb == nd->path.dentry->d_sb, right? IOW, that dentry belongs to the filesystem pointed to be vfsmount. At ->follow_link() time we have two objects to keep track of: where we are in pathname resolution (i.e. which directory had we reached) and what symlink are we resolving. nd->path points to the former; the first argument of ->follow_link() - to the latter. _Usually_ the parent is on the same vfsmount as the symlink, so your shortcut doesn't blow up all the time. However, that is not guaranteed - it *is* possible to bind a symlink. Not conveniently at the moment, but it can be done and there's no fundamental reasons why it should be forbidden. And in that setup proc_ns_follow_link() is in trouble. The reasons for check_mnt() there are, IIRC, more about playing silly buggers with propagation trees from some other namespace. IOW, they do not apply to cloning stuff off the internal vfsmounts. /proc/*/map_files/* is potentially a problem, though - being able to create a binding to something on hugetlbfs might have interesting implications. ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <20140117083901.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>]
* [PATCH 0/4] d_dname cleanups [not found] ` <20140117083901.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> @ 2014-02-07 2:21 ` Eric W. Biederman [not found] ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2014-02-07 2:21 UTC (permalink / raw) To: Al Viro Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds Al, This is a small pile of cleanups to the callers d_dname. This patset changes all implementors of d_dname to d_alloc_pseudo, simplifies the set of conditions in which we consider calling d_dname, moves the code into prepend_path, and finally adds a call to d_dname into __dentry_path allowing __dentry_path will return sensible for dentries that implement d_dname. The patches are available at: git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git userns-vfs-d_dname-cleanups Or in the patches following this email. Eric W. Biederman (4): perfmon: Use d_alloc_pseudo like all of the d_dname callers. vfs: Simply when d_alloc_dname is called. vfs: Move the call of d_op->d_dname from d_path to prepend_path vfs: Call d_dname from dentry_path arch/ia64/kernel/perfmon.c | 4 ++-- fs/dcache.c | 38 +++++++++++++++++++++++--------------- 2 files changed, 25 insertions(+), 17 deletions(-) ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* [PATCH 1/4] perfmon: Use d_alloc_pseudo like all of the d_dname callers. [not found] ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2014-02-07 2:23 ` Eric W. Biederman 2014-02-07 2:23 ` [PATCH 2/4] vfs: Simply when d_alloc_dname is called Eric W. Biederman ` (2 subsequent siblings) 3 siblings, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2014-02-07 2:23 UTC (permalink / raw) To: Al Viro Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- arch/ia64/kernel/perfmon.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c index cb592773c78b..d1611f223768 100644 --- a/arch/ia64/kernel/perfmon.c +++ b/arch/ia64/kernel/perfmon.c @@ -2202,14 +2202,14 @@ pfm_alloc_file(pfm_context_t *ctx) /* * allocate a new dcache entry */ - path.dentry = d_alloc(pfmfs_mnt->mnt_root, &this); + path.dentry = d_alloc_pseudo(pfmfs_mnt->mnt_sb, &this); if (!path.dentry) { iput(inode); return ERR_PTR(-ENOMEM); } path.mnt = mntget(pfmfs_mnt); - d_add(path.dentry, inode); + d_instantiate(path.dentry, inode); file = alloc_file(&path, FMODE_READ, &pfm_file_ops); if (IS_ERR(file)) { -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 2/4] vfs: Simply when d_alloc_dname is called. [not found] ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2014-02-07 2:23 ` [PATCH 1/4] perfmon: Use d_alloc_pseudo like all of the d_dname callers Eric W. Biederman @ 2014-02-07 2:23 ` Eric W. Biederman 2014-02-07 2:24 ` [PATCH 3/4] vfs: Move the call of d_op->d_dname from d_path to prepend_path Eric W. Biederman 2014-02-07 2:24 ` [PATCH 4/4] vfs: Call d_dname from dentry_path Eric W. Biederman 3 siblings, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2014-02-07 2:23 UTC (permalink / raw) To: Al Viro Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds Now that all implementations of d_dname use d_alloc_psuedo only consider calling d_dname on root dentries, as all dentries returned d_alloc_pseudo by d_alloc_pseudo are root dentries. Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/dcache.c | 16 ++++++---------- 1 files changed, 6 insertions(+), 10 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 265e0ce9769c..c250d97befe4 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -3056,18 +3056,14 @@ char *d_path(const struct path *path, char *buf, int buflen) int error; /* - * We have various synthetic filesystems that never get mounted. On - * these filesystems dentries are never used for lookup purposes, and - * thus don't need to be hashed. They also don't need a name until a - * user wants to identify the object in /proc/pid/fd/. The little hack - * below allows us to generate a name for these objects on demand: - * - * Some pseudo inodes are mountable. When they are mounted - * path->dentry == path->mnt->mnt_root. In that case don't call d_dname - * and instead have d_path return the mounted path. + * We have various synthetic files allocated with + * d_alloc_pseudo that are not available through + * ordinary path lookup and don't need a name until + * a user wants to identify the object in + * /proc/pid/fd/ or similiar. */ if (path->dentry->d_op && path->dentry->d_op->d_dname && - (!IS_ROOT(path->dentry) || path->dentry != path->mnt->mnt_root)) + IS_ROOT(path->dentry) && path->dentry != path->mnt->mnt_root) return path->dentry->d_op->d_dname(path->dentry, buf, buflen); rcu_read_lock(); -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 3/4] vfs: Move the call of d_op->d_dname from d_path to prepend_path [not found] ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2014-02-07 2:23 ` [PATCH 1/4] perfmon: Use d_alloc_pseudo like all of the d_dname callers Eric W. Biederman 2014-02-07 2:23 ` [PATCH 2/4] vfs: Simply when d_alloc_dname is called Eric W. Biederman @ 2014-02-07 2:24 ` Eric W. Biederman 2014-02-07 2:24 ` [PATCH 4/4] vfs: Call d_dname from dentry_path Eric W. Biederman 3 siblings, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2014-02-07 2:24 UTC (permalink / raw) To: Al Viro Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds For clarity move the call of d_dname from d_path into prepend_path. There are only 4 callers of prepend_path and this change does not affect the results returned by any of them. d_absolute_path and __d_path return an error and ignore the output of prepend_path error == 2. getcwd only operates on directories and all of the implementations of d_dname are on files. For d_path the argument of no changes in output is a little trickier. path_with_deleted does not trigger because pseudo dentries are match IS_ROOT so is_unlinked is false. All of current implementations of d_dname are files so they are not currently anyone's d_parent, which means today that either the first iteration of prepend_path will call d_dname or none of the iterations of prepend_path will call d_dname. d_dname is called from the while loop in prepend_path so that if it happens in the future that any of the implementors of d_dname are a directory instead of a file then prepend_path will work properly on them. Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/dcache.c | 32 +++++++++++++++++++++----------- 1 files changed, 21 insertions(+), 11 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index c250d97befe4..c5c7847ff84b 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -2892,6 +2892,27 @@ restart: while (dentry != root->dentry || vfsmnt != root->mnt) { struct dentry * parent; + /* + * We have various synthetic files allocated with + * d_alloc_pseudo that are not available through + * ordinary path lookup and don't need a name until + * a user wants to identify the object in + * /proc/pid/fd/ or similiar. + */ + if (IS_ROOT(dentry) && dentry != vfsmnt->mnt_root && + dentry->d_op && dentry->d_op->d_dname) { + char *buf = bptr - blen; + char *name = dentry->d_op->d_dname(dentry, buf, blen); + if (IS_ERR(name)) { + error = PTR_ERR(name); + break; + } + blen = name - buf; + bptr = name; + error = 2; /* unmounted by definition */ + break; + } + if (dentry == vfsmnt->mnt_root || IS_ROOT(dentry)) { struct mount *parent = ACCESS_ONCE(mnt->mnt_parent); /* Global root? */ @@ -3055,17 +3076,6 @@ char *d_path(const struct path *path, char *buf, int buflen) struct path root; int error; - /* - * We have various synthetic files allocated with - * d_alloc_pseudo that are not available through - * ordinary path lookup and don't need a name until - * a user wants to identify the object in - * /proc/pid/fd/ or similiar. - */ - if (path->dentry->d_op && path->dentry->d_op->d_dname && - IS_ROOT(path->dentry) && path->dentry != path->mnt->mnt_root) - return path->dentry->d_op->d_dname(path->dentry, buf, buflen); - rcu_read_lock(); get_fs_root_rcu(current->fs, &root); error = path_with_deleted(path, &root, &res, &buflen); -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH 4/4] vfs: Call d_dname from dentry_path [not found] ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> ` (2 preceding siblings ...) 2014-02-07 2:24 ` [PATCH 3/4] vfs: Move the call of d_op->d_dname from d_path to prepend_path Eric W. Biederman @ 2014-02-07 2:24 ` Eric W. Biederman 3 siblings, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2014-02-07 2:24 UTC (permalink / raw) To: Al Viro Cc: Aditya Kali, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds If the root dentry implements d_dname instead of ignoring it in __dentry_path call d_dname and return the result. The motivating example are files under /proc/<pid>/ns/ are bind mounted into other locations. This change allow the name of the dentry that is bind mounted to show up in /proc/<pid>/mountinfo. However in general this is correct behavior any time dentry_path is called on a dentry that implements d_dname. Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/dcache.c | 2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index c5c7847ff84b..a7a925fb3ce7 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -3162,6 +3162,8 @@ restart: done_seqretry(&rename_lock, seq); if (error) goto Elong; + if (dentry->d_op && dentry->d_op->d_dname) + retval = dentry->d_op->d_dname(dentry, buf, len); return retval; Elong: return ERR_PTR(-ENAMETOOLONG); -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point [not found] ` <20131130170226.GZ10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2013-11-30 21:51 ` Eric W. Biederman @ 2013-12-01 5:09 ` Al Viro 2013-12-01 6:15 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mountpoint Tetsuo Handa 2013-12-02 5:43 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point NeilBrown 2 siblings, 1 reply; 41+ messages in thread From: Al Viro @ 2013-12-01 5:09 UTC (permalink / raw) To: Tetsuo Handa Cc: Aditya Kali, Eric W. Biederman, Neil Brown, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds On Sat, Nov 30, 2013 at 05:02:26PM +0000, Al Viro wrote: > FWIW, the other callers of prepend_path() boil down to /proc/mountinfo > handling, apparmour d_namespace_path() (separate handling of MNT_INTERNAL, > __d_path() or d_absolute_path() for the rest) and tomoyo_get_absolute_path() > (this one directly calls ->d_dname() itself). While we are at it, what's the origin of if (buflen >= 256) checks in tomoyo_get_absolute_path() and tomoyo_get_dentry_path()? The minimal buflen value they can be called with is PAGE_SIZE - 1. And if there is a port with pages two times smaller than VAX ones, they'd better audit the tree for places assuming that PAGE_SIZE is at least 4K... ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mountpoint 2013-12-01 5:09 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Al Viro @ 2013-12-01 6:15 ` Tetsuo Handa 0 siblings, 0 replies; 41+ messages in thread From: Tetsuo Handa @ 2013-12-01 6:15 UTC (permalink / raw) To: viro Cc: serge, gaofeng, containers, linux-fsdevel, adityakali, oleg, luto, torvalds, edumazet, neilb, ebiederm Al Viro wrote: > On Sat, Nov 30, 2013 at 05:02:26PM +0000, Al Viro wrote: > > > FWIW, the other callers of prepend_path() boil down to /proc/mountinfo > > handling, apparmour d_namespace_path() (separate handling of MNT_INTERNAL, > > __d_path() or d_absolute_path() for the rest) and tomoyo_get_absolute_path() > > (this one directly calls ->d_dname() itself). > > While we are at it, what's the origin of if (buflen >= 256) checks in > tomoyo_get_absolute_path() and tomoyo_get_dentry_path()? The minimal > buflen value they can be called with is PAGE_SIZE - 1. This is a sanity check in case the caller by error called tomoyo_get_absolute_path() or tomoyo_get_local_path() with buflen < 2, for they might accesses buffer[buflen - 2]. But "PAGE_SIZE <= buf_len <= (UINT_MAX + 1) / 2" will be true because doing buf_len <<= 1 will not make buf_len == 0 since kmalloc((UINT_MAX + 1) / 2) will return NULL. Therefore, we can ignore/kill "if (buflen >= 256)" check. Regards. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point [not found] ` <20131130170226.GZ10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2013-11-30 21:51 ` Eric W. Biederman 2013-12-01 5:09 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Al Viro @ 2013-12-02 5:43 ` NeilBrown [not found] ` <20131202164359.4f4f2c94-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2 siblings, 1 reply; 41+ messages in thread From: NeilBrown @ 2013-12-02 5:43 UTC (permalink / raw) To: Al Viro, J.Bruce Fields Cc: Aditya Kali, NFS, Containers, Oleg Nesterov, Andy Lutomirski, Eric Dumazet, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Linus Torvalds, Eric W. Biederman [-- Attachment #1.1: Type: text/plain, Size: 1678 bytes --] On Sat, 30 Nov 2013 17:02:26 +0000 Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote: > BTW, what happens if svc_export_request() ends up with pathname filling > almost all space left, so that qword_add(bpp, blen, pth) right after the > call of d_path() in there overwrites the beginning of d_path() output? > Neil? And while we are at it, handling of overflow in there looks also > looks fishy... In this case the returned *blen will be negative so cache_request() in net/sunrpc/cache.h will return -EAGAIN. cache_read() would then go into a tight loop, trying again and again to create the request, but it will never fit in the buffer. I guess maybe an EINVAL might help there, plus code to skip over impossible requests. Maybe the following. We'd need to double check that no ->cache_request function can fail in a way that is worth retrying, but I doubt they would. Any thoughts Bruce? Thanks, NeilBrown diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c index a72de074172d..a065f827e2a3 100644 --- a/net/sunrpc/cache.c +++ b/net/sunrpc/cache.c @@ -748,7 +748,7 @@ static int cache_request(struct cache_detail *detail, detail->cache_request(detail, crq->item, &bp, &len); if (len < 0) - return -EAGAIN; + return -EINVAL; return PAGE_SIZE - len; } @@ -788,6 +788,11 @@ static ssize_t cache_read(struct file *filp, char __user *buf, size_t count, if (rq->len == 0) { err = cache_request(cd, rq); + if (err == -EINVAL) { + spin_lock(&queue_lock); + list_move(&rp->q.list, &rq->q.list); + spin_unlock(&queue_lock); + } if (err < 0) goto out; rq->len = err; [-- Attachment #1.2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] [-- Attachment #2: Type: text/plain, Size: 205 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linuxfoundation.org/mailman/listinfo/containers ^ permalink raw reply related [flat|nested] 41+ messages in thread
[parent not found: <20131202164359.4f4f2c94-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>]
* Re: [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point [not found] ` <20131202164359.4f4f2c94-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> @ 2013-12-02 16:23 ` J.Bruce Fields 0 siblings, 0 replies; 41+ messages in thread From: J.Bruce Fields @ 2013-12-02 16:23 UTC (permalink / raw) To: NeilBrown Cc: Al Viro, Eric W. Biederman, Serge E. Hallyn, Gao feng, Containers, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Aditya Kali, Oleg Nesterov, Andy Lutomirski, Linus Torvalds, Eric Dumazet, NFS On Mon, Dec 02, 2013 at 04:43:59PM +1100, NeilBrown wrote: > On Sat, 30 Nov 2013 17:02:26 +0000 Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote: > > > > BTW, what happens if svc_export_request() ends up with pathname filling > > almost all space left, so that qword_add(bpp, blen, pth) right after the > > call of d_path() in there overwrites the beginning of d_path() output? > > Neil? And while we are at it, handling of overflow in there looks also > > looks fishy... > > In this case the returned *blen will be negative so cache_request() in > net/sunrpc/cache.h will return -EAGAIN. > cache_read() would then go into a tight loop, trying again and again to > create the request, but it will never fit in the buffer. > > I guess maybe an EINVAL might help there, plus code to skip over impossible > requests. > Maybe the following. We'd need to double check that no ->cache_request > function can fail in a way that is worth retrying, but I doubt they would. > > Any thoughts Bruce? Yes, the cache_request failures are all similarly fatal overflows. Uh, not sure if I remember how the data structures here work: so userspace is now going to get an -EINVAL on read, and may just end up retrying the read? Or do we hit the "need to release rq" case in cache_read and just discard the request? --b. > > Thanks, > NeilBrown > > > diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c > index a72de074172d..a065f827e2a3 100644 > --- a/net/sunrpc/cache.c > +++ b/net/sunrpc/cache.c > @@ -748,7 +748,7 @@ static int cache_request(struct cache_detail *detail, > > detail->cache_request(detail, crq->item, &bp, &len); > if (len < 0) > - return -EAGAIN; > + return -EINVAL; > return PAGE_SIZE - len; > } > > @@ -788,6 +788,11 @@ static ssize_t cache_read(struct file *filp, char __user *buf, size_t count, > > if (rq->len == 0) { > err = cache_request(cd, rq); > + if (err == -EINVAL) { > + spin_lock(&queue_lock); > + list_move(&rp->q.list, &rq->q.list); > + spin_unlock(&queue_lock); > + } > if (err < 0) > goto out; > rq->len = err; -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 41+ messages in thread
* [REVIEW][PATCH 2/3] fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) [not found] ` <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 0:16 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Eric W. Biederman @ 2013-11-27 0:16 ` Eric W. Biederman [not found] ` <87vbzezojq.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 0:17 ` [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc Eric W. Biederman 2 siblings, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2013-11-27 0:16 UTC (permalink / raw) To: Serge E. Hallyn Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes: > Hi Oleg, > > commit 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e : > "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks" > breaks lxc-attach in 3.12. That code forks a child which does > setns() and then does a clone(CLONE_PARENT). That way the > grandchild can be in the right namespaces (which the child was > not) and be a child of the original task, which is the monitor. > > lxc-attach in 3.11 was working fine with no side effects that I > could see. Is there a real danger in allowing CLONE_PARENT > when current->nsproxy->pidns_for_children is not our pidns, > or was this done out of an "over-abundance of caution"? Can we > safely revert that new extra check? The two fundamental things I know we can not allow are: - A shared signal queue aka CLONE_THREAD. Because we compute the pid and uid of the signal when we place it in the queue. - Changing the pid and by extention pid_namespace of an existing process. >From a parents perspective there is nothing special about the pid namespace, to deny CLONE_PARENT, because the parent simply won't know or care. >From the childs perspective all that is special really are shared signal queues. User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks in different pid namespaces is almost certainly going to break because it is complicated. But shared signal handlers can look at per thread information to know which pid namespace a process is in, so I don't know of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads at the kernel level. It would be absolutely stupid to implement but that is a different thing. So hmm. Because it can do no harm, and because it is a regression let's remove the CLONE_PARENT check and send it stable. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Acked-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Acked-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- kernel/fork.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 728d5be9548c..f82fa2ee7581 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1171,7 +1171,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, * do not allow it to share a thread group or signal handlers or * parent with the forking task. */ - if (clone_flags & (CLONE_SIGHAND | CLONE_PARENT)) { + if (clone_flags & CLONE_SIGHAND) { if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) || (task_active_pid_ns(current) != current->nsproxy->pid_ns_for_children)) -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 41+ messages in thread
[parent not found: <87vbzezojq.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [REVIEW][PATCH 2/3] fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) [not found] ` <87vbzezojq.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-11-27 1:58 ` Serge E. Hallyn 0 siblings, 0 replies; 41+ messages in thread From: Serge E. Hallyn @ 2013-11-27 1:58 UTC (permalink / raw) To: Eric W. Biederman Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > > Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> writes: > > Hi Oleg, > > > > commit 40a0d32d1eaffe6aac7324ca92604b6b3977eb0e : > > "fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checks" > > breaks lxc-attach in 3.12. That code forks a child which does > > setns() and then does a clone(CLONE_PARENT). That way the > > grandchild can be in the right namespaces (which the child was > > not) and be a child of the original task, which is the monitor. > > > > lxc-attach in 3.11 was working fine with no side effects that I > > could see. Is there a real danger in allowing CLONE_PARENT > > when current->nsproxy->pidns_for_children is not our pidns, > > or was this done out of an "over-abundance of caution"? Can we > > safely revert that new extra check? > > The two fundamental things I know we can not allow are: > - A shared signal queue aka CLONE_THREAD. Because we compute the pid > and uid of the signal when we place it in the queue. > > - Changing the pid and by extention pid_namespace of an existing > process. > > >From a parents perspective there is nothing special about the pid > namespace, to deny CLONE_PARENT, because the parent simply won't know or > care. > > >From the childs perspective all that is special really are shared signal > queues. > > User mode threading with CLONE_PARENT|CLONE_VM|CLONE_SIGHAND and tasks > in different pid namespaces is almost certainly going to break because > it is complicated. But shared signal handlers can look at per thread > information to know which pid namespace a process is in, so I don't know > of any reason not to support CLONE_PARENT|CLONE_VM|CLONE_SIGHAND threads > at the kernel level. It would be absolutely stupid to implement but > that is a different thing. > > So hmm. > > Because it can do no harm, and because it is a regression let's remove > the CLONE_PARENT check and send it stable. > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Acked-by: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > Acked-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> > Acked-by: Serge E. Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> Thanks, Eric. > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > --- > kernel/fork.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/kernel/fork.c b/kernel/fork.c > index 728d5be9548c..f82fa2ee7581 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1171,7 +1171,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, > * do not allow it to share a thread group or signal handlers or > * parent with the forking task. > */ > - if (clone_flags & (CLONE_SIGHAND | CLONE_PARENT)) { > + if (clone_flags & CLONE_SIGHAND) { > if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) || > (task_active_pid_ns(current) != > current->nsproxy->pid_ns_for_children)) > -- > 1.7.5.4 ^ permalink raw reply [flat|nested] 41+ messages in thread
* [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 0:16 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Eric W. Biederman 2013-11-27 0:16 ` [REVIEW][PATCH 2/3] fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) Eric W. Biederman @ 2013-11-27 0:17 ` Eric W. Biederman [not found] ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 16:13 ` Oleg Nesterov 2 siblings, 2 replies; 41+ messages in thread From: Eric W. Biederman @ 2013-11-27 0:17 UTC (permalink / raw) To: Serge E. Hallyn Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit e51db73532955dc5eaba4235e62b74b460709d5b userns: Better restrictions on when proc and sysfs can be mounted caused a regression on mounting a new instance of proc in a mount namespace created with user namespace privileges, when binfmt_misc is mounted on /proc/sys/fs/binfmt_misc. This is an unintended regression caused by the absolutely bogus empty directory check in fs_fully_visible. The check fs_fully_visible replaced didn't even bother to attempt to verify proc was fully visible and hiding proc files with any kind of mount is rare. So for now fix the userspace regression by allowing directory with nlink == 1 as /proc/sys/fs/binfmt_misc has. I will have a better patch but it is not stable material, or last minute kernel material. So it will have to wait. Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> --- fs/namespace.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index ac2ce8a766e1..be32ebccdeb1 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type) struct inode *inode = child->mnt_mountpoint->d_inode; if (!S_ISDIR(inode->i_mode)) goto next; - if (inode->i_nlink != 2) + if (inode->i_nlink > 2) goto next; } visible = true; -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 41+ messages in thread
[parent not found: <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-11-27 0:21 ` Andy Lutomirski [not found] ` <CALCETrVp78EfzY3Oa-LV1Hm8A4Y35apehcxrxdyrzvTb5sp=pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-11-27 2:00 ` Serge E. Hallyn 2013-11-27 3:19 ` Gao feng 2 siblings, 1 reply; 41+ messages in thread From: Andy Lutomirski @ 2013-11-27 0:21 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Aditya Kali, Containers, Oleg Nesterov, Linux FS Devel On Tue, Nov 26, 2013 at 4:17 PM, Eric W. Biederman <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > > Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit > e51db73532955dc5eaba4235e62b74b460709d5b > userns: Better restrictions on when proc and sysfs can be mounted > caused a regression on mounting a new instance of proc in a mount > namespace created with user namespace privileges, when binfmt_misc > is mounted on /proc/sys/fs/binfmt_misc. > > This is an unintended regression caused by the absolutely bogus empty > directory check in fs_fully_visible. The check fs_fully_visible replaced > didn't even bother to attempt to verify proc was fully visible and > hiding proc files with any kind of mount is rare. So for now fix > the userspace regression by allowing directory with nlink == 1 > as /proc/sys/fs/binfmt_misc has. > > I will have a better patch but it is not stable material, or > last minute kernel material. So it will have to wait. Is the better fix to fix procfs to set nlink == 2? --Andy > > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > --- > fs/namespace.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index ac2ce8a766e1..be32ebccdeb1 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type) > struct inode *inode = child->mnt_mountpoint->d_inode; > if (!S_ISDIR(inode->i_mode)) > goto next; > - if (inode->i_nlink != 2) > + if (inode->i_nlink > 2) > goto next; > } > visible = true; > -- > 1.7.5.4 > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <CALCETrVp78EfzY3Oa-LV1Hm8A4Y35apehcxrxdyrzvTb5sp=pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <CALCETrVp78EfzY3Oa-LV1Hm8A4Y35apehcxrxdyrzvTb5sp=pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-11-27 0:36 ` Eric W. Biederman 0 siblings, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2013-11-27 0:36 UTC (permalink / raw) To: Andy Lutomirski; +Cc: Aditya Kali, Containers, Oleg Nesterov, Linux FS Devel Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes: > On Tue, Nov 26, 2013 at 4:17 PM, Eric W. Biederman > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: >> >> Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit >> e51db73532955dc5eaba4235e62b74b460709d5b >> userns: Better restrictions on when proc and sysfs can be mounted >> caused a regression on mounting a new instance of proc in a mount >> namespace created with user namespace privileges, when binfmt_misc >> is mounted on /proc/sys/fs/binfmt_misc. >> >> This is an unintended regression caused by the absolutely bogus empty >> directory check in fs_fully_visible. The check fs_fully_visible replaced >> didn't even bother to attempt to verify proc was fully visible and >> hiding proc files with any kind of mount is rare. So for now fix >> the userspace regression by allowing directory with nlink == 1 >> as /proc/sys/fs/binfmt_misc has. >> >> I will have a better patch but it is not stable material, or >> last minute kernel material. So it will have to wait. > > Is the better fix to fix procfs to set nlink == 2? The better fix should be to drop locks, read the directory (f_op->iterate?) and ensure it is empty and then take locks again. nlink is insufficient to check if a directory is empty and a mount is covering a file with something interesting. Only under /proc/sys/... do directories have nlink == 1 so the nlink check continues to provide value for now. The only real world reasonable cases are mounting over an empty directory in /proc or /sys or mounting over the filesystem entirely and the nlink check actually catches the latter because the nlink count is correct on the root directories. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 0:21 ` Andy Lutomirski @ 2013-11-27 2:00 ` Serge E. Hallyn 2013-11-27 3:19 ` Gao feng 2 siblings, 0 replies; 41+ messages in thread From: Serge E. Hallyn @ 2013-11-27 2:00 UTC (permalink / raw) To: Eric W. Biederman Cc: Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > > Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit > e51db73532955dc5eaba4235e62b74b460709d5b > userns: Better restrictions on when proc and sysfs can be mounted > caused a regression on mounting a new instance of proc in a mount > namespace created with user namespace privileges, when binfmt_misc > is mounted on /proc/sys/fs/binfmt_misc. > > This is an unintended regression caused by the absolutely bogus empty > directory check in fs_fully_visible. The check fs_fully_visible replaced > didn't even bother to attempt to verify proc was fully visible and > hiding proc files with any kind of mount is rare. So for now fix > the userspace regression by allowing directory with nlink == 1 > as /proc/sys/fs/binfmt_misc has. > > I will have a better patch but it is not stable material, or > last minute kernel material. So it will have to wait. > > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> Thanks, Eric, this should make user namespaces useful again for containers. Acked-by: Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org> > --- > fs/namespace.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index ac2ce8a766e1..be32ebccdeb1 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type) > struct inode *inode = child->mnt_mountpoint->d_inode; > if (!S_ISDIR(inode->i_mode)) > goto next; > - if (inode->i_nlink != 2) > + if (inode->i_nlink > 2) > goto next; > } > visible = true; > -- > 1.7.5.4 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 0:21 ` Andy Lutomirski 2013-11-27 2:00 ` Serge E. Hallyn @ 2013-11-27 3:19 ` Gao feng 2013-11-27 5:00 ` Eric W. Biederman 2 siblings, 1 reply; 41+ messages in thread From: Gao feng @ 2013-11-27 3:19 UTC (permalink / raw) To: Eric W. Biederman, Serge E. Hallyn Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Aditya Kali, Containers, Oleg Nesterov, Andy Lutomirski On 11/27/2013 08:17 AM, Eric W. Biederman wrote: > > Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> reported that commit > e51db73532955dc5eaba4235e62b74b460709d5b > userns: Better restrictions on when proc and sysfs can be mounted > caused a regression on mounting a new instance of proc in a mount > namespace created with user namespace privileges, when binfmt_misc > is mounted on /proc/sys/fs/binfmt_misc. > > This is an unintended regression caused by the absolutely bogus empty > directory check in fs_fully_visible. The check fs_fully_visible replaced > didn't even bother to attempt to verify proc was fully visible and > hiding proc files with any kind of mount is rare. So for now fix > the userspace regression by allowing directory with nlink == 1 > as /proc/sys/fs/binfmt_misc has. > > I will have a better patch but it is not stable material, or > last minute kernel material. So it will have to wait. > > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > Signed-off-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> > --- > fs/namespace.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index ac2ce8a766e1..be32ebccdeb1 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type) > struct inode *inode = child->mnt_mountpoint->d_inode; > if (!S_ISDIR(inode->i_mode)) > goto next; > - if (inode->i_nlink != 2) > + if (inode->i_nlink > 2) > goto next; > } > visible = true; > As a quick fix. Acked-by: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> Tested-by: Gao feng <gaofeng-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org> looking forward to your following patch. :) Thanks! ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc 2013-11-27 3:19 ` Gao feng @ 2013-11-27 5:00 ` Eric W. Biederman 0 siblings, 0 replies; 41+ messages in thread From: Eric W. Biederman @ 2013-11-27 5:00 UTC (permalink / raw) To: Gao feng Cc: Serge E. Hallyn, Containers, linux-fsdevel, Aditya Kali, Oleg Nesterov, Andy Lutomirski Gao feng <gaofeng@cn.fujitsu.com> writes: > On 11/27/2013 08:17 AM, Eric W. Biederman wrote: >> >> Gao feng <gaofeng@cn.fujitsu.com> reported that commit >> e51db73532955dc5eaba4235e62b74b460709d5b >> userns: Better restrictions on when proc and sysfs can be mounted >> caused a regression on mounting a new instance of proc in a mount >> namespace created with user namespace privileges, when binfmt_misc >> is mounted on /proc/sys/fs/binfmt_misc. >> >> This is an unintended regression caused by the absolutely bogus empty >> directory check in fs_fully_visible. The check fs_fully_visible replaced >> didn't even bother to attempt to verify proc was fully visible and >> hiding proc files with any kind of mount is rare. So for now fix >> the userspace regression by allowing directory with nlink == 1 >> as /proc/sys/fs/binfmt_misc has. >> >> I will have a better patch but it is not stable material, or >> last minute kernel material. So it will have to wait. >> >> Cc: stable@vger.kernel.org >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> >> --- >> fs/namespace.c | 2 +- >> 1 files changed, 1 insertions(+), 1 deletions(-) >> >> diff --git a/fs/namespace.c b/fs/namespace.c >> index ac2ce8a766e1..be32ebccdeb1 100644 >> --- a/fs/namespace.c >> +++ b/fs/namespace.c >> @@ -2886,7 +2886,7 @@ bool fs_fully_visible(struct file_system_type *type) >> struct inode *inode = child->mnt_mountpoint->d_inode; >> if (!S_ISDIR(inode->i_mode)) >> goto next; >> - if (inode->i_nlink != 2) >> + if (inode->i_nlink > 2) >> goto next; >> } >> visible = true; >> > > As a quick fix. > > Acked-by: Gao feng <gaofeng@cn.fujitsu.com> > Tested-by: Gao feng <gaofeng@cn.fujitsu.com> > > looking forward to your following patch. :) I might have to be prodded. Sometimes it looks easy and sometimes I go ick locking craziness. Once I am done sorting out the regressions I plan on focusing on the mount issues between namespaces. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc 2013-11-27 0:17 ` [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc Eric W. Biederman [not found] ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-11-27 16:13 ` Oleg Nesterov [not found] ` <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2013-11-27 16:13 UTC (permalink / raw) To: Eric W. Biederman Cc: Serge E. Hallyn, Gao feng, Containers, linux-fsdevel, Aditya Kali, Andy Lutomirski To all: sorry for noise, I can't comment this patch. But Eric, could you please help me to understand? I am totally confused. So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc") should return false. After the normal "mout -t proc none /proc/" it becomes true. And it is still true after, say, "mount -t ramfs none /proc/sys" because "ls -ld /proc/sys" shows ->i_nlink == 1. However, say, "mount -t ramfs none /proc/tty/" should make fs_fully_visible() == F, because in this case ->i_nlink == 4. Correct? If yes, could you explain what this "!CAP_SYS_ADMIN && !fs_fully_visible" check actually tries to prevent and why? Thanks, Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2013-11-27 16:29 ` Serge E. Hallyn [not found] ` <20131127162928.GB7358-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2013-11-27 16:41 ` Andy Lutomirski 2013-11-27 18:51 ` Eric W. Biederman 2 siblings, 1 reply; 41+ messages in thread From: Serge E. Hallyn @ 2013-11-27 16:29 UTC (permalink / raw) To: Oleg Nesterov Cc: Aditya Kali, Containers, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman Quoting Oleg Nesterov (oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org): > To all: sorry for noise, I can't comment this patch. > > > But Eric, could you please help me to understand? I am totally confused. > > So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc") > should return false. > > After the normal "mout -t proc none /proc/" it becomes true. > > And it is still true after, say, "mount -t ramfs none /proc/sys" because > "ls -ld /proc/sys" shows ->i_nlink == 1. > > However, say, "mount -t ramfs none /proc/tty/" should make > fs_fully_visible() == F, because in this case ->i_nlink == 4. > > Correct? > > If yes, could you explain what this "!CAP_SYS_ADMIN && !fs_fully_visible" > check actually tries to prevent and why? The idea is that some admin on a host where /a/b/c/d exists, c/d should be hidden, so overmounts a tmpfs onto /a/b/c. In that case, an unpriv user could clone(CLONE_NEWUSER), then clone(CLONE_NEWNS), then umount /a/b/c and see /a/b/c/d. This patch was to try and prevent that. I argue that it is easy enough for the admin to make /a/b/c permissions such that it can't be crossed by an unprivileged user, but that does require an action on the part of admins who used this (imo misguided) approach. -serge ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <20131127162928.GB7358-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <20131127162928.GB7358-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> @ 2013-11-27 18:09 ` Oleg Nesterov 0 siblings, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2013-11-27 18:09 UTC (permalink / raw) To: Serge E. Hallyn Cc: Aditya Kali, Containers, Andy Lutomirski, Eric W. Biederman, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On 11/27, Serge E. Hallyn wrote: > > Quoting Oleg Nesterov (oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org): > > To all: sorry for noise, I can't comment this patch. > > > > > > But Eric, could you please help me to understand? I am totally confused. > > > > So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc") > > should return false. > > > > After the normal "mout -t proc none /proc/" it becomes true. > > > > And it is still true after, say, "mount -t ramfs none /proc/sys" because > > "ls -ld /proc/sys" shows ->i_nlink == 1. > > > > However, say, "mount -t ramfs none /proc/tty/" should make > > fs_fully_visible() == F, because in this case ->i_nlink == 4. > > > > Correct? > > > > If yes, could you explain what this "!CAP_SYS_ADMIN && !fs_fully_visible" > > check actually tries to prevent and why? > > The idea is that some admin on a host where /a/b/c/d exists, c/d should > be hidden, so overmounts a tmpfs onto /a/b/c. In that case, an unpriv > user could clone(CLONE_NEWUSER), then clone(CLONE_NEWNS), then umount > /a/b/c and see /a/b/c/d. This patch was to try and prevent that. Thanks Serge, but now I am even more confused... fs_fully_visible() is only called by proc/sysfs_mount ? Perhaps you meant that admin may want to hide something in /proc or /sys? This is what I suspected initially, but see "mount /proc/sys" above... I guess you can ignore me, I don't understand this at all. To the point, suppose that /proc was never mounted, even in the root namespace. Then how a sub-namespace mount it? fs_fully_visible() can't return true. OK, sorry again for noise. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-11-27 16:29 ` Serge E. Hallyn @ 2013-11-27 16:41 ` Andy Lutomirski [not found] ` <CALCETrXFnw63=JoEaQxM+Opj+kCXSL=9XppymzGKhLzOnp3WaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-11-27 18:51 ` Eric W. Biederman 2 siblings, 1 reply; 41+ messages in thread From: Andy Lutomirski @ 2013-11-27 16:41 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Aditya Kali, Containers, Linux FS Devel, Eric W. Biederman On Wed, Nov 27, 2013 at 8:13 AM, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > To all: sorry for noise, I can't comment this patch. > > > But Eric, could you please help me to understand? I am totally confused. > > So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc") > should return false. > > After the normal "mout -t proc none /proc/" it becomes true. > > And it is still true after, say, "mount -t ramfs none /proc/sys" because > "ls -ld /proc/sys" shows ->i_nlink == 1. Grr. mount -t ramfs none /proc/sys should make proc be not "fully visible". --Andy ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <CALCETrXFnw63=JoEaQxM+Opj+kCXSL=9XppymzGKhLzOnp3WaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <CALCETrXFnw63=JoEaQxM+Opj+kCXSL=9XppymzGKhLzOnp3WaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-11-27 18:10 ` Oleg Nesterov 0 siblings, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2013-11-27 18:10 UTC (permalink / raw) To: Andy Lutomirski Cc: Aditya Kali, Containers, Linux FS Devel, Eric W. Biederman On 11/27, Andy Lutomirski wrote: > > On Wed, Nov 27, 2013 at 8:13 AM, Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > To all: sorry for noise, I can't comment this patch. > > > > > > But Eric, could you please help me to understand? I am totally confused. > > > > So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc") > > should return false. > > > > After the normal "mout -t proc none /proc/" it becomes true. > > > > And it is still true after, say, "mount -t ramfs none /proc/sys" because > > "ls -ld /proc/sys" shows ->i_nlink == 1. > > Grr. mount -t ramfs none /proc/sys should make proc be not "fully visible". This was one of the reasons for my question ;) Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-11-27 16:29 ` Serge E. Hallyn 2013-11-27 16:41 ` Andy Lutomirski @ 2013-11-27 18:51 ` Eric W. Biederman 2013-11-27 19:47 ` Oleg Nesterov 2 siblings, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2013-11-27 18:51 UTC (permalink / raw) To: Oleg Nesterov Cc: Aditya Kali, Containers, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > To all: sorry for noise, I can't comment this patch. > > > But Eric, could you please help me to understand? I am totally confused. > > So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc") > should return false. > > After the normal "mout -t proc none /proc/" it becomes true. > > And it is still true after, say, "mount -t ramfs none /proc/sys" because > "ls -ld /proc/sys" shows ->i_nlink == 1. > > However, say, "mount -t ramfs none /proc/tty/" should make > fs_fully_visible() == F, because in this case ->i_nlink == 4. > > Correct? > > If yes, could you explain what this "!CAP_SYS_ADMIN && !fs_fully_visible" > check actually tries to prevent and why? Right now a more accurate name would be fs_visible. The primary goal is to restrict the user namespace root to being able to mount proc and sysfs if nothing new is revealed, preserving the restrictions in jail type environments where proc and sysfs are not mounted at all. I would have dropped the ability for the userns root to create a fresh mount of proc and syfs, and required a bind mount instead except because of the namespace intertwining we need fresh mounts of proc and sys after new namespaces are created. What I really would love to have implemented would be a check that nothing is mounted on top of proc of sysfs but that fails because we have /sys/fs/.. and /proc/sys/fs_binfmt_misc. So I had to implement an empty directory check. And when I first implemented that empty directory check I had a complete and total brain fart. So what is really implemented at the moment would be more accurately called fs_visible(). For the real concern about jail environments where proc and sysfs are not mounted at all a fs_visible check is all that is really required, and if you look at 3.11? I think it was that is what was shipped. Unfortunately I cleaned things up and totally had a massive brain fart and failed to implement the is this an empty directory check properly and somehow failed to test mounting /proc in a real world environment and so caused a significant regression in 3.12 leading to a case where proc is not mountable. This patch just fixes that stupid regression. As the overmount protection in general is not needed as to my knolwedge no one overmounts proc or sysfs in a meaningful way. Now it would be very good to go back and fix fs_fully_visible to either be fs_visible and not care or to read overmounted directories and verify that they are actually empty. Reading directories at this point requires opening a file and probably playing with locks. Something I am not prepared to do for a trivial bug fix. Especially since getting it right only closes theoretical holes at the moment. Does that clarify things? Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc 2013-11-27 18:51 ` Eric W. Biederman @ 2013-11-27 19:47 ` Oleg Nesterov 2013-11-27 19:52 ` Eric W. Biederman 0 siblings, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2013-11-27 19:47 UTC (permalink / raw) To: Eric W. Biederman Cc: Serge E. Hallyn, Gao feng, Containers, linux-fsdevel, Aditya Kali, Andy Lutomirski Just to avoid the possible confusion, let me repeat that the fix itsef looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong. I am not arguing with this patch, I am just trying to understand this logic. On 11/27, Eric W. Biederman wrote: > > [... snip ...] Thanks a lot. > For the real concern about jail environments where proc and sysfs are > not mounted at all a fs_visible check is all that is really required, this is what I can't understand... Lets ignore the implementation details. Suppose that proc was never mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS? Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc 2013-11-27 19:47 ` Oleg Nesterov @ 2013-11-27 19:52 ` Eric W. Biederman [not found] ` <87iovdmxl7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 41+ messages in thread From: Eric W. Biederman @ 2013-11-27 19:52 UTC (permalink / raw) To: Oleg Nesterov Cc: Serge E. Hallyn, Gao feng, Containers, linux-fsdevel, Aditya Kali, Andy Lutomirski Oleg Nesterov <oleg@redhat.com> writes: > Just to avoid the possible confusion, let me repeat that the fix itsef > looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong. > > I am not arguing with this patch, I am just trying to understand this > logic. > > On 11/27, Eric W. Biederman wrote: >> >> [... snip ...] > > Thanks a lot. > >> For the real concern about jail environments where proc and sysfs are >> not mounted at all a fs_visible check is all that is really required, > > this is what I can't understand... > > Lets ignore the implementation details. Suppose that proc was never > mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS? Yes. Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <87iovdmxl7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <87iovdmxl7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-11-27 20:01 ` Oleg Nesterov 2013-11-27 20:07 ` Eric W. Biederman 1 sibling, 0 replies; 41+ messages in thread From: Oleg Nesterov @ 2013-11-27 20:01 UTC (permalink / raw) To: Eric W. Biederman Cc: Aditya Kali, Containers, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On 11/27, Eric W. Biederman wrote: > > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > > > Just to avoid the possible confusion, let me repeat that the fix itsef > > looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong. > > > > I am not arguing with this patch, I am just trying to understand this > > logic. > > > > On 11/27, Eric W. Biederman wrote: > >> > >> [... snip ...] > > > > Thanks a lot. > > > >> For the real concern about jail environments where proc and sysfs are > >> not mounted at all a fs_visible check is all that is really required, > > > > this is what I can't understand... > > > > Lets ignore the implementation details. Suppose that proc was never > > mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS? > > Yes. OK, and I agree this makes sense. Just from the code inspection it wasn't clear to me if this was intended or not. Plus nlink("/proc/sys") == 1 added more confusion ;) Thanks to all. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <87iovdmxl7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 20:01 ` Oleg Nesterov @ 2013-11-27 20:07 ` Eric W. Biederman 2013-11-27 20:41 ` Andy Lutomirski [not found] ` <87wqjtlic3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 1 sibling, 2 replies; 41+ messages in thread From: Eric W. Biederman @ 2013-11-27 20:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Aditya Kali, Containers, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > >> Just to avoid the possible confusion, let me repeat that the fix itsef >> looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong. >> >> I am not arguing with this patch, I am just trying to understand this >> logic. >> >> On 11/27, Eric W. Biederman wrote: >>> >>> [... snip ...] >> >> Thanks a lot. >> >>> For the real concern about jail environments where proc and sysfs are >>> not mounted at all a fs_visible check is all that is really required, >> >> this is what I can't understand... >> >> Lets ignore the implementation details. Suppose that proc was never >> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS? > > Yes. Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID. If proc was never mounted. Fresh mounts of proc are not allowed unless you have also created the pid namespace. With just CLONE_NEWUSER | NEWNS you are limited to bind mounts. Has this cleared up the confusion? Eric ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc 2013-11-27 20:07 ` Eric W. Biederman @ 2013-11-27 20:41 ` Andy Lutomirski [not found] ` <CALCETrUwjK7iLMMJaCvKUbBwEqV58oXY4dWzTGJohYgg4DwjWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> [not found] ` <87wqjtlic3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 41+ messages in thread From: Andy Lutomirski @ 2013-11-27 20:41 UTC (permalink / raw) To: Eric W. Biederman Cc: Oleg Nesterov, Serge E. Hallyn, Gao feng, Containers, Linux FS Devel, Aditya Kali On Wed, Nov 27, 2013 at 12:07 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > ebiederm@xmission.com (Eric W. Biederman) writes: > >> Oleg Nesterov <oleg@redhat.com> writes: >> >>> Just to avoid the possible confusion, let me repeat that the fix itsef >>> looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong. >>> >>> I am not arguing with this patch, I am just trying to understand this >>> logic. >>> >>> On 11/27, Eric W. Biederman wrote: >>>> >>>> [... snip ...] >>> >>> Thanks a lot. >>> >>>> For the real concern about jail environments where proc and sysfs are >>>> not mounted at all a fs_visible check is all that is really required, >>> >>> this is what I can't understand... >>> >>> Lets ignore the implementation details. Suppose that proc was never >>> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS? >> >> Yes. > > Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID. > If proc was never mounted. > > Fresh mounts of proc are not allowed unless you have also created the > pid namespace. With just CLONE_NEWUSER | NEWNS you are limited to bind > mounts. > > Has this cleared up the confusion? > > Eric > This is all obnoxiously complicated. I wonder if we can do (a lot) better by allowing a "pid-only" variant of proc to be mounted. It should contain: - All the pid directories - /proc/self, /proc/net, and /proc/mounts (but possibly not /proc/PID/net -- that's a weird interface IMO and isn't really related to the pid) - keys key-users (wtf is up with that interface, though -- those files are way too magical) - cpuinfo, version, and maybe other informational things (crypto?) - loadavg, perhaps I wonder it would be possible to boot a reasonable container with a heavily limited /proc like that. --Andy ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <CALCETrUwjK7iLMMJaCvKUbBwEqV58oXY4dWzTGJohYgg4DwjWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <CALCETrUwjK7iLMMJaCvKUbBwEqV58oXY4dWzTGJohYgg4DwjWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-11-29 14:56 ` Serge E. Hallyn 0 siblings, 0 replies; 41+ messages in thread From: Serge E. Hallyn @ 2013-11-29 14:56 UTC (permalink / raw) To: Andy Lutomirski Cc: Aditya Kali, Eric W. Biederman, Containers, Oleg Nesterov, Linux FS Devel Quoting Andy Lutomirski (luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org): > On Wed, Nov 27, 2013 at 12:07 PM, Eric W. Biederman > <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote: > > ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > > > >> Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > >> > >>> Just to avoid the possible confusion, let me repeat that the fix itsef > >>> looks "obviously fine" to me, "i_nlink != 2" looks obviously wrong. > >>> > >>> I am not arguing with this patch, I am just trying to understand this > >>> logic. > >>> > >>> On 11/27, Eric W. Biederman wrote: > >>>> > >>>> [... snip ...] > >>> > >>> Thanks a lot. > >>> > >>>> For the real concern about jail environments where proc and sysfs are > >>>> not mounted at all a fs_visible check is all that is really required, > >>> > >>> this is what I can't understand... > >>> > >>> Lets ignore the implementation details. Suppose that proc was never > >>> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS? > >> > >> Yes. > > > > Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID. > > If proc was never mounted. > > > > Fresh mounts of proc are not allowed unless you have also created the > > pid namespace. With just CLONE_NEWUSER | NEWNS you are limited to bind > > mounts. > > > > Has this cleared up the confusion? > > > > Eric > > > > This is all obnoxiously complicated. I wonder if we can do (a lot) > better by allowing a "pid-only" variant of proc to be mounted. It > should contain: > > - All the pid directories > - /proc/self, /proc/net, and /proc/mounts (but possibly not > /proc/PID/net -- that's a weird interface IMO and isn't really related > to the pid) > - keys key-users (wtf is up with that interface, though -- those > files are way too magical) > - cpuinfo, version, and maybe other informational things (crypto?) > - loadavg, perhaps > > I wonder it would be possible to boot a reasonable container with a > heavily limited /proc like that. Should be possible. And heck, maybe some of the values could then be virtualized :) cmdline could point to the container init's cmdline; cpuinfo and loadavg and meminfo be filtered through cgroupfs. -serge ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <87wqjtlic3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc [not found] ` <87wqjtlic3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2013-11-29 19:53 ` Oleg Nesterov 2013-12-13 22:07 ` Richard Weinberger 0 siblings, 1 reply; 41+ messages in thread From: Oleg Nesterov @ 2013-11-29 19:53 UTC (permalink / raw) To: Eric W. Biederman Cc: Aditya Kali, Containers, Andy Lutomirski, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA On 11/27, Eric W. Biederman wrote: > > ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > > > Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> writes: > >> > >> Lets ignore the implementation details. Suppose that proc was never > >> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS? > > > > Yes. > > Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID. Yes, yes, I understand, the mounter should be CAP_SYS_ADMIN in task_active_pid_ns(). > Has this cleared up the confusion? Yes. Thanks. Oleg. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc 2013-11-29 19:53 ` Oleg Nesterov @ 2013-12-13 22:07 ` Richard Weinberger 0 siblings, 0 replies; 41+ messages in thread From: Richard Weinberger @ 2013-12-13 22:07 UTC (permalink / raw) To: Oleg Nesterov Cc: Eric W. Biederman, Serge E. Hallyn, Gao feng, Containers, linux-fsdevel, Aditya Kali, Andy Lutomirski On Fri, Nov 29, 2013 at 8:53 PM, Oleg Nesterov <oleg@redhat.com> wrote: > On 11/27, Eric W. Biederman wrote: >> >> ebiederm@xmission.com (Eric W. Biederman) writes: >> >> > Oleg Nesterov <oleg@redhat.com> writes: >> >> >> >> Lets ignore the implementation details. Suppose that proc was never >> >> mounted. Then "mount -t proc" should fail after CLONE_NEWUSER | NEWNS? >> > >> > Yes. >> >> Well strictly speaking it should fail after CLONE_NEWUSER | NEWNS | NEWPID. > > Yes, yes, I understand, the mounter should be CAP_SYS_ADMIN in > task_active_pid_ns(). > >> Has this cleared up the confusion? > > Yes. Thanks. > > Oleg. What is the state of this work? It would be nice to see this fixed seen. Users on 3.12 suffer already from this regression. -- Thanks, //richard ^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2014-02-07 2:24 UTC | newest] Thread overview: 41+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20131115164123.GN28794@redhat.com> [not found] ` <20131116164840.GA4441@mail.hallyn.com> [not found] ` <20131117030653.GA7670@mail.hallyn.com> [not found] ` <20131118031932.GA17621@mail.hallyn.com> [not found] ` <52899D09.5080202@cn.fujitsu.com> [not found] ` <20131118140830.GA22075@mail.hallyn.com> [not found] ` <20131118180134.GA24156@mail.hallyn.com> [not found] ` <87k3g5gnuv.fsf@xmission.com> [not found] ` <20131126181043.GA25492@mail.hallyn.com> [not found] ` <20131126181043.GA25492-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2013-11-27 0:14 ` [REVIEW][PATCH 0/3] userns fixes for v3.13-rc1 Eric W. Biederman [not found] ` <87siui1z1g.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 0:16 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Eric W. Biederman 2013-11-27 1:58 ` Serge E. Hallyn [not found] ` <8738mi1yya.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-30 6:15 ` Al Viro [not found] ` <20131130061525.GY10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2013-11-30 17:02 ` Al Viro [not found] ` <20131130170226.GZ10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2013-11-30 21:51 ` Eric W. Biederman [not found] ` <87a9glh838.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-30 22:43 ` Al Viro [not found] ` <20131130224340.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2013-12-02 7:29 ` Al Viro 2014-01-17 3:29 ` Eric W. Biederman [not found] ` <874n53gub7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2014-01-17 8:39 ` Al Viro [not found] ` <20140117083901.GA10323-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org> 2014-02-07 2:21 ` [PATCH 0/4] d_dname cleanups Eric W. Biederman [not found] ` <87iosrhdc0.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2014-02-07 2:23 ` [PATCH 1/4] perfmon: Use d_alloc_pseudo like all of the d_dname callers Eric W. Biederman 2014-02-07 2:23 ` [PATCH 2/4] vfs: Simply when d_alloc_dname is called Eric W. Biederman 2014-02-07 2:24 ` [PATCH 3/4] vfs: Move the call of d_op->d_dname from d_path to prepend_path Eric W. Biederman 2014-02-07 2:24 ` [PATCH 4/4] vfs: Call d_dname from dentry_path Eric W. Biederman 2013-12-01 5:09 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point Al Viro 2013-12-01 6:15 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mountpoint Tetsuo Handa 2013-12-02 5:43 ` [REVIEW][PATCH 1/3] vfs: In d_path don't call d_dname on a mount point NeilBrown [not found] ` <20131202164359.4f4f2c94-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org> 2013-12-02 16:23 ` J.Bruce Fields 2013-11-27 0:16 ` [REVIEW][PATCH 2/3] fork: Allow CLONE_PARENT after setns(CLONE_NEWPID) Eric W. Biederman [not found] ` <87vbzezojq.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 1:58 ` Serge E. Hallyn 2013-11-27 0:17 ` [REVIEW][PATCH 3/3] vfs: Fix a regression in mounting proc Eric W. Biederman [not found] ` <87pppmzoin.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 0:21 ` Andy Lutomirski [not found] ` <CALCETrVp78EfzY3Oa-LV1Hm8A4Y35apehcxrxdyrzvTb5sp=pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-11-27 0:36 ` Eric W. Biederman 2013-11-27 2:00 ` Serge E. Hallyn 2013-11-27 3:19 ` Gao feng 2013-11-27 5:00 ` Eric W. Biederman 2013-11-27 16:13 ` Oleg Nesterov [not found] ` <20131127161300.GA24773-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2013-11-27 16:29 ` Serge E. Hallyn [not found] ` <20131127162928.GB7358-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2013-11-27 18:09 ` Oleg Nesterov 2013-11-27 16:41 ` Andy Lutomirski [not found] ` <CALCETrXFnw63=JoEaQxM+Opj+kCXSL=9XppymzGKhLzOnp3WaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-11-27 18:10 ` Oleg Nesterov 2013-11-27 18:51 ` Eric W. Biederman 2013-11-27 19:47 ` Oleg Nesterov 2013-11-27 19:52 ` Eric W. Biederman [not found] ` <87iovdmxl7.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-27 20:01 ` Oleg Nesterov 2013-11-27 20:07 ` Eric W. Biederman 2013-11-27 20:41 ` Andy Lutomirski [not found] ` <CALCETrUwjK7iLMMJaCvKUbBwEqV58oXY4dWzTGJohYgg4DwjWQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-11-29 14:56 ` Serge E. Hallyn [not found] ` <87wqjtlic3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2013-11-29 19:53 ` Oleg Nesterov 2013-12-13 22:07 ` Richard Weinberger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).