From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mail-wm0-f65.google.com ([74.125.82.65]:35952 "EHLO
	mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S932226AbcEKQnE (ORCPT
	<rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 11 May 2016 12:43:04 -0400
Date: Wed, 11 May 2016 17:42:47 +0100
From: Djalal Harouni <tixxdz@gmail.com>
To: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>, Chris Mason <clm@fb.com>,
	tytso@mit.edu, Serge Hallyn <serge.hallyn@canonical.com>,
	Josh Triplett <josh@joshtriplett.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Andy Lutomirski <luto@kernel.org>,
	Seth Forshee <seth.forshee@canonical.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-security-module@vger.kernel.org,
	Dongsu Park <dongsu@endocode.com>,
	David Herrmann <dh.herrmann@googlemail.com>,
	Miklos Szeredi <mszeredi@redhat.com>,
	Alban Crequy <alban.crequy@gmail.com>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
Message-ID: <20160511164247.GA9908@dztty.fritz.box>
References: <1462372014-3786-1-git-send-email-tixxdz@gmail.com>
 <1462395979.14310.133.camel@HansenPartnership.com>
 <20160505073636.GA3357@dztty>
 <1462449388.2419.27.camel@HansenPartnership.com>
 <20160505214957.GA3071@dztty>
 <1462486085.2289.23.camel@HansenPartnership.com>
 <1462923416.14896.10.camel@HansenPartnership.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1462923416.14896.10.camel@HansenPartnership.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
> On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote:
[...]
> > 
> > OK, so the way attributes are populated on an inode is via getattr. 
> >  You intercept that, you change the inode owner and group that are
> > installed on the inode.  That means that when you list the directory,
> > you see the shift and the shifted uid/gid are used to check 
> > permissions for vfs_open().
> 
> Just to illustrate how this could be done, here's a functional proof of
> concept for a uid/gid shifting bind mount equivalent.  It's not
> actually a proper bind mount because it has to manufacture its own
> inodes.  As you can see, it can only be used by root, it will shift all
> the uid/gid bits as well as the permission comparisons.  It operates on
> subtrees, so it can shift the uids/gids on any filesystem or part of
> one and because the shifts are per superblock, it could actually shift
> the same subtree for multiple users on different shifts.  Best of all,
> it requires no vfs changes at all, being entirely implemented inside
> its own filesystem type.

First, I guess this should be in a separate thread.. this way this RFC
was just hijacked!

Obviously as you say later in your response it may require a VFS
change... 

You have just consumed all inodes... what about containers or small apps
that are spawned quickly... it can even used maybe as a DoS...  maybe you
endup reporting different inode numbers... ?


> You use it just like bind mount:
> 
> mount -t shiftfs <source> <target>
> 
> except that it takes uidshift=x:y:z and gidshift=x:y:z multiple times
> as options.  It's currently not recursive and it definitely needs
> polishing to show things like mount options and be properly Kconfig
> using.

why it's not recursive ? and what if you have circular bind mounts ? 

Hmm anyway you are mounting this on behalf of filesystems, so if you add
the recursive thing, you will just probably make everything worse, by
making any /proc, /sys dentry that's under that path shiftable, and
unprivileged users can just create user namespaces and read /proc/*
and all the other stuff that doesn't have capable() related to the
init_user_ns host...

  what if you have paths like /filesystem0/uidshiftedY/dir,
/filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir 
where some of them are also bind mounts that point to same dentry ?


Also, you create a totally new user namespace interface here! by making
your own new interface we just lose the notion of init_user_ns and its
children and mapping ?

I'm not sure of the implication of all this... your user namespace
mapping is not related at all to init_user_ns! it seems that it has
its own init_user_ns ?   does a capable() check now on a shifted
filesystem relates to that and hence to your mapping or to the real
init_user_ns ?


> There's a bit of an open question of whether it should have vfs
> changes: the way the struct file f_inode and f_ops are hijacked is a
> bit nasty and perhaps d_select_inode() could be made a bit cleverer to
> help us here instead.

I'm not sure if this PoC works... but you sure you didn't introduce
a serious vulnerability here ? you use a new mapping and you update
current_fsuid() creds up, which is global on any fs operation, so may
be: lets operate on any inode, update our current_fsuid()... and
access the rest of *unshifted filesystems*... !?

The worst thing is that current_fsuid() does not follow now the
/proc/self/uid_map interface! this is a serious vulnerability and a mix
of the current semantics... it's updated but using other rules...?

For overlayfs I did write an expriment but for me it's not an overlayfs
or another new filesystem problem... we are manipulating UID/GID
identities...

It would have been better if you did send this as a separate thread.
It was a vfs:userns RFC fix which if we continue we turn it into a
complicated thing! implement another new light filesystem with
userns... (overlayfs...)

Will follow up if the appropriate thread is created, not here, I guess
it's ok ?

> James
> 

Thank you for your feedback!


-- 
Djalal Harouni
http://opendz.org