Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

All of lore.kernel.org
 help / color / mirror / Atom feed

From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Djalal Harouni <tixxdz@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	Chris Mason <clm@fb.com>,
	tytso@mit.edu, Serge Hallyn <serge.hallyn@canonical.com>,
	Josh Triplett <josh@joshtriplett.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Andy Lutomirski <luto@kernel.org>,
	Seth Forshee <seth.forshee@canonical.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org,
	Dongsu Park <dongsu@endocode.com>,
	David Herrmann <dh.herrmann@googlemail.com>,
	Miklos Szeredi <mszeredi@redhat.com>,
	Alban Crequy <alban.crequy@gmail.com>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
Date: Thu, 12 May 2016 15:24:12 -0700	[thread overview]
Message-ID: <1463091852.2380.72.camel@HansenPartnership.com> (raw)
In-Reply-To: <20160512195552.GB2859@dztty>

On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> [...]
> > > Hmm anyway you are mounting this on behalf of filesystems, so if
> > > you 
> > > add the recursive thing, you will just probably make everything 
> > > worse, by making any /proc, /sys dentry that's under that path 
> > > shiftable, and unprivileged users can just create user namespaces
> > > and 
> > > read /proc/* and all the other stuff that doesn't have capable() 
> > > related to the init_user_ns host...
> > 
> > That's up to the admin who does the shifting.  Recursive would be
> > an
> > option if added.
> 
> Hmm, not sure if you get my point... you just made it an admin 
> problem where admins want to mount an image downloaded verify it and 
> use it for their container with /proc...! that's another problem!

You can't allow unprivileged containers to shift uids on arbitrary
filesystems, so the admin always has to do something for the initial
setup.

> > >   what if you have paths like /filesystem0/uidshiftedY/dir,
> > > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
> > > where some of them are also bind mounts that point to same dentry
> > > ?
> > 
> > Without recursive semantics, you see the underlying inode.  With 
> > them, you see the upper vfsmnts.  Shiftfs isn't idempotent, so you 
> > would need to be careful about nesting.  However, that's an admin
> > problem.
> > 
> > > Also, you create a totally new user namespace interface here! by 
> > > making your own new interface we just lose the notion of 
> > > init_user_ns and its children and mapping ?
> > 
> > I don't quite understand this; the only use of the init_user_ns is 
> > the capable(CAP_SYS_ADMIN) in fill_super which is how only the real
> > admin can mount at a shifted uid/gid.  Otherwise, there's no need 
> > to see into the userns because filesystems see the kuid_t/kgid_t 
> > which is what I'm shifting.
> > 
> > > I'm not sure of the implication of all this... your user 
> > > namespace mapping is not related at all to init_user_ns! it seems 
> > > that it has its own init_user_ns ?   does a capable() check now 
> > > on a shifted filesystem relates to that and hence to your mapping 
> > > or to the real init_user_ns ?
> > 
> > capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)
> > 
> > Or is there a misunderstanding here about how user namespaces work
> > inside the kernel?  The design is that the ID shift is done as you
> > cross the kernel boundary, so a filesystem, being usually all in
> > -kernel operating via the VFS interfaces, ideally never needs to 
> > make any from_kuid/make_kuid calls.  However, there are ways 
> > filesystems can send data across the kernel/user bounary outside of 
> > the usual vfs interfaces (ioctls being the most usual one) so in 
> > that specific code, they have to do the kuid_t to uid_t changes 
> > themselves.  Shiftfs never sends data to the user outside of the 
> > VFS so it never needs to do this and can operate entirely on
> > kuid_ts.
> > 
> > > > There's a bit of an open question of whether it should have vfs
> > > > changes: the way the struct file f_inode and f_ops are hijacked 
> > > > is a bit nasty and perhaps d_select_inode() could be made a bit
> > > > cleverer to help us here instead.
> > > 
> > > I'm not sure if this PoC works... but you sure you didn't 
> > > introduce a serious vulnerability here ? you use a new mapping 
> > > and you update current_fsuid() creds up, which is global on any 
> > > fs operation, so may be: lets operate on any inode, update our 
> > > current_fsuid()... and access the rest of *unshifted filesystems*
> > > ... !?
> > 
> > The credentials are per thread, so it's a standard way of doing
> > credential shifting and no other threads of execution in the same 
> > task get access. As long as you bound the override_creds()/revert_c
> > reds() pairs within the kernel, you're safe.
> 
> No, and here sorry I mean shifted.
> 
> current_fsuid() is global through all fs operations which means it
> crosses user namespaces... it was safe the days of only init_user_ns,
> not anymore... You give a mapping inside containers to fsuid where 
> they don't want to have it... this allows to operate on inodes inside
> other containers... update current_fsuid() even if we want that user 
> to be nobody inside the container... and later it can access the 
> inodes of the shifted fs... and by same current of course...

OK, I still don't understand what you're getting at.  There are three
per-thread uids: uid, euid and fsuid (real, effective and filesystem). 
 They're all either settable via syscall or inherited on fork.  They're
all kernel side, meaning they're kuid_t.  Their values stay invariant
as you move through namespaces.  They change (and get mapped according
to the current user namespace setting) when you call set[fe]uid() So
when I enter a user namespace with mapping

0 100000 1000

and call setuid(0) (which sets all three). they all pick up the kuid_t
of 100000.  This means that writing a file inside the user namespace
after calling setuid(0) appears as real uid 100000 on the medium even
though if I call getuid() from the namespace, I get back 0.  What
shiftfs does is hijack temporarily the kernel fsuid/fsgid for
permission checks, so you can remap to any old uid on the medium
(although usually you'd pass in uidmap=0:100000:1000") it maps back
from kuid_t 100000 to kuid_t 0, which is why the container can now read
and write the underlying medium at on-media id 0 even through root
inside the container has kuid_t 100000.  There's no permanent change of
fsuid and it stays at its invariant value for the thread except as a
temporary measure to do the permission checks on the underlying of the
shifted filesystem.

> > > The worst thing is that current_fsuid() does not follow now the
> > > /proc/self/uid_map interface! this is a serious vulnerability and 
> > > a mix of the current semantics... it's updated but using other
> > > rules...?
> > 
> > current_fsuid() is aready mapped via the userns; it's already a 
> > kuid_t at its final value.  Shifting that is what you want to remap
> > underlying volume uid/gid's.  The uidmap/gidmap inputs to this are 
> > shifts on the final underlying uid/gids.
> 
> => some points:
> Changing setfsuid() its interfaces and rules... or an indrect way to
> break another syscall...

There is no change to setfsuid().

> The userns used for *mapping* is totatly different and not standard..
> . losing "init_user_ns and its decendents userns *semantics*...", a
> yet a totatly unlinked mapping...

There is no user namespace mapping at all.  This is a simple shift,
kernel side, of uids and gids at their kuid_t values.

> Breaking current_uid(),current_euid(),current_fsuid() which are
> mapped but in *different* user namespaces... hence different values
> inside namespaces... you can change your userns mapping but that
> current_fsuid specific one will always be remapped to some other 
> value inside even if you don't want it... It crosses user 
> namespaces...  uid and euid are remapped according to /proc/self/uid_
> map, fsuid is remapped according to this new interface...
> 
> Hard coding the mapping, nested containers/apps may *share* fsuid and
> can't get rid of it even if they change the inside userns mapping to
> disable, split, reduce mapped users or offer better isolation they
> can't... no way to make private inodes inside containers if they 
> share the final fsuid, inside container mapping is ignored...
> ...

OK, I think there's a misunderstanding about how credential overrides
work.  They're not permanent changes to the credentials, they're
temporary ones to get stuff done within the kernel at a temporary
privilege.  You can make credentials permanent if you go through
prepare_creds()/commit_creds(), but for making them temporary you do
prepare_creds()/override_creds() and then revert_creds() once you're
done using them.

If you want to see a current use of this, try fs/open.c:faccessat. 
 What it's doing is temporarily overriding fsuid with the real uid to
check the permissions before reverting the credentials and returning to
the user.

James

> > the privileged ids down to 100000, but I have a volume which still 
> > has realids, I can mount that volume using shiftfs with
> > uidmap=0:100000:1000 and it will allow this userns to read and 
> > write the volume through its remapped ids.
> > 
> > > For overlayfs I did write an expriment but for me it's not an 
> > > overlayfs or another new filesystem problem... we are 
> > > manipulating UID/GID identities...
> > > 
> > > It would have been better if you did send this as a separate 
> > > thread. It was a vfs:userns RFC fix which if we continue we turn 
> > > it into a complicated thing! implement another new light 
> > > filesystem with userns... (overlayfs...)
> > > 
> > > Will follow up if the appropriate thread is created, not here, I 
> > > guess it's ok ?
> > 
> > Well, I can resend the patch as a separate thread when I've fixed 
> > some of the problems viro pointed out.
> > 
> > James
> > 
> > > > James
> > > > 
> > > 
> > > Thank you for your feedback!
> > > 
> > > 
> > 
> 
> Thanks!
>

next prev parent reply	other threads:[~2016-05-12 22:24 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-04 14:26 [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to shift their UIDs/GIDs Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs to virtual view Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid Djalal Harouni
2016-05-04 23:19   ` Serge Hallyn
2016-05-05 13:05     ` Seth Forshee
2016-05-05 22:40       ` Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission access Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk view Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write to disk Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids mount options Djalal Harouni
2016-05-04 14:26 ` [RFC v2 PATCH 8/8] btrfs: " Djalal Harouni
2016-05-04 16:34 ` [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems Josh Triplett
2016-05-04 21:06 ` James Bottomley
2016-05-05  7:36   ` Djalal Harouni
2016-05-05 11:56     ` James Bottomley
2016-05-05 21:49       ` Djalal Harouni
2016-05-05 22:08         ` James Bottomley
2016-05-10 23:36           ` James Bottomley
2016-05-11  0:38             ` Al Viro
2016-05-11  0:53             ` Al Viro
2016-05-11  3:47               ` James Bottomley
2016-05-11 16:42             ` Djalal Harouni
2016-05-11 18:33               ` James Bottomley
2016-05-12 19:55                 ` Djalal Harouni
2016-05-12 22:24                   ` James Bottomley [this message]
2016-05-14  9:53                     ` Djalal Harouni
2016-05-14 13:46                       ` James Bottomley
2016-05-15  2:21                         ` Eric W. Biederman
2016-05-15 15:04                           ` James Bottomley
2016-05-16 14:12                           ` Seth Forshee
2016-05-16 16:42                             ` Eric W. Biederman
2016-05-16 18:25                               ` Seth Forshee
2016-05-16 19:13                           ` James Bottomley
2016-05-17 22:40                             ` Eric W. Biederman
2016-05-17 11:42                           ` Djalal Harouni
2016-05-17 15:42                         ` Djalal Harouni
2016-05-04 23:30 ` Serge Hallyn
2016-05-06 14:38   ` Djalal Harouni
2016-05-09 16:26     ` Serge Hallyn
2016-05-10 10:33       ` Djalal Harouni
2016-05-05  0:23 ` Dave Chinner
2016-05-05  1:44   ` Andy Lutomirski
2016-05-05  2:25     ` Dave Chinner
2016-05-05  3:29       ` Andy Lutomirski
2016-05-05 22:34     ` Djalal Harouni
2016-05-05 22:24   ` Djalal Harouni
2016-05-06  2:50     ` Dave Chinner
2016-05-12 19:47       ` Djalal Harouni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1463091852.2380.72.camel@HansenPartnership.com \
    --to=james.bottomley@hansenpartnership.com \
    --cc=alban.crequy@gmail.com \
    --cc=clm@fb.com \
    --cc=david@fromorbit.com \
    --cc=dh.herrmann@googlemail.com \
    --cc=dongsu@endocode.com \
    --cc=ebiederm@xmission.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mszeredi@redhat.com \
    --cc=serge.hallyn@canonical.com \
    --cc=seth.forshee@canonical.com \
    --cc=tixxdz@gmail.com \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.