From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753223AbcFFWCg (ORCPT <rfc822;w@1wt.eu>);
	Mon, 6 Jun 2016 18:02:36 -0400
Received: from bedivere.hansenpartnership.com ([66.63.167.143]:60080 "EHLO
	bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1751170AbcFFWCe (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 6 Jun 2016 18:02:34 -0400
Message-ID: <1465250550.2393.91.camel@HansenPartnership.com>
Subject: Re: [PATCH 0/1] shiftfs: uid/gid shifting filesystem
From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: Djalal Harouni <tixxdz@gmail.com>
Cc: =?UTF-8?Q?Micha=C5=82?= Zegan <webczat_200@poczta.onet.pl>,
        Chris Mason <clm@fb.com>, tytso@mit.edu,
        Serge Hallyn <serge.hallyn@canonical.com>,
        Josh Triplett <josh@joshtriplett.org>,
        "Eric W. Biederman" <ebiederm@xmission.com>,
        Andy Lutomirski <luto@kernel.org>,
        Seth Forshee <seth.forshee@canonical.com>,
        linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
        linux-security-module@vger.kernel.org,
        Dongsu Park <dongsu@endocode.com>,
        David Herrmann <dh.herrmann@googlemail.com>,
        Miklos Szeredi <mszeredi@redhat.com>,
        Alban Crequy <alban.crequy@gmail.com>,
        Al Viro <viro@ZenIV.linux.org.uk>
Date: Mon, 06 Jun 2016 15:02:30 -0700
In-Reply-To: <20160605211154.GA2901@dztty>
References: <1464740984.7732.5.camel@HansenPartnership.com>
	 <2b58696f-492c-3230-0a3c-2f6f9fbff931@poczta.onet.pl>
	 <1464799260.2445.10.camel@HansenPartnership.com>
	 <20160605211154.GA2901@dztty>
Content-Type: multipart/mixed; boundary="=-qvKxEdYdco3xoTy+gOhg"
X-Mailer: Evolution 3.16.5 
Mime-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


--=-qvKxEdYdco3xoTy+gOhg
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit

On Sun, 2016-06-05 at 22:11 +0100, Djalal Harouni wrote:
> On Wed, Jun 01, 2016 at 12:41:00PM -0400, James Bottomley wrote:
> > On Wed, 2016-06-01 at 18:21 +0200, Michał Zegan wrote:
> > > As I sent a reply in a ... wrong way, I do it again. my question
> > > was:
> > > Why isn't it done at the vfs layer when you mount the fs in
> > > different
> > > userns, instead of using a separate filesystem for it?
> > 
> > Well, that is what this patch does:
> > 
> > http://thread.gmane.org/gmane.linux.kernel/2214882
> > 
> > However, the reason it doesn't work for me is that I want to be
> > able to
> > unpack the image into a subdirectory (so I'm not dedicating a whole
> > filesystem for this).  This is primarily for a docker hack IBM is
> > working on to allow each container instance to use a separate
> > uid/gid
> > range, so I need something that behaves much more like a bind
> > mount.
> I thought that you were using a loop device ?

No, for Architectural emulation containers, I use file roots, so
they're subdirectories of my home directory.  The interesting issues
Serge discovered are on ext4, which I needed a loop device to reproduce
(my home directory is xfs) if that's where the confusion arises?

Thinking about containers in general, a significant amount use bind
mounted file roots because that's a nice use case that hypervisors
can't match without clusterable filesystems.  However, I do know some
containers that are block image based, so whatever solution is chosen
has to support both.

>  that's precisely one of the main case that's solved with that 
> solution... mount the portable fs image into a loop device, set the 
> shift which will be only active into that subdirectory...
> 
> 
> > >  I believe it could be useful to be able to mount all filesystems 
> > > in userns with autoshifted uids, although I do not know security
> > > implications for that usage.
> > 
> > As long as you don't need to subdivide the volume, it works nicely.
> >  However, from a security point of view, that entire volume is now
> > effectively freely writeable by anyone who can set up a userns.  If 
> > you follow the shiftfs route, you can break off writeable
> > subdirectories for each namespace shift, but they can't cross over 
> > into writing subdirectories that belong to other user namespaces 
> > (assuming the uids are fully segregated).
> 
> As said in the other email, I'm not really sure about the use case at
> all... but I give you this quick test with:
> https://gist.githubusercontent.com/tixxdz/6b84c2c3bd6cb987c82255602ec
> 70f23/raw/97c9ab76878f9d7415583c00b22ca0e4a948847b/userns_test.c
> 
> $ mkdir shifted-fedora-tree && sudo mount -t shiftfs 
> -ouidmap=0:1000000:65536,gidmap=0:1000000:65536 ~/fedora-tree/
> shifted-fedora-tree

This is basically what I do for my container roots.  However, after
that I tend to set them up with scripts.  I've attached my latest build
-container script at the bottom.  As you can see from my script, all my
build containers are in /home/jejb/containers.

> [tixxdz@fedora-kvm bin]$ sudo ./userns-test -m -U -M "0 1000000 1"
> /bin/bash
> uid=0(root) gid=65534(nfsnobody) groups=65534(nfsnobody)
> context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> [root@fedora-kvm bin]# cat /proc/self/uid_map 
>          0    1000000          1
> [root@fedora-kvm bin]# echo "$(id -u)_not_a_sandboxed_app" >> shifted
> -fedora-tree/etc/fedora-release
> [root@fedora-kvm bin]# exit
> exit
> [tixxdz@fedora-kvm bin]$ sudo ./userns-test -m -U -M "48 1000000 1"
> /bin/bash
> [apache@fedora-kvm bin]$ id
> uid=48(apache) gid=65534(nfsnobody) groups=65534(nfsnobody)
> context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> [apache@fedora-kvm bin]$ echo "$(id -u)_not_a_sandboxed_app" >>
> shifted-fedora-tree/etc/fedora-release
> [apache@fedora-kvm bin]$ exit
> exit
> [tixxdz@fedora-kvm bin]$ sudo ./userns-test -m -U -M "70 1000000 1"
> /bin/bash
> [avahi@fedora-kvm bin]$ id
> uid=70(avahi) gid=65534(nfsnobody) groups=65534(nfsnobody)
> context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> [avahi@fedora-kvm bin]$ echo "$(id -u)_not_a_sandboxed_app" >>
> shifted-fedora-tree/etc/fedora-release
> [avahi@fedora-kvm bin]$ exit
> exit
> [tixxdz@fedora-kvm bin]$ sudo ./userns-test -m -U -M "1000 1000000 1"
> /bin/bash
> [tixxdz@fedora-kvm bin]$ id
> uid=1000(tixxdz) gid=65534(nfsnobody) groups=65534(nfsnobody)
> context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> [tixxdz@fedora-kvm bin]$ echo "$(id -u)_not_a_sandboxed_app" >>
> shifted-fedora-tree/etc/fedora-release
> [tixxdz@fedora-kvm bin]$ exit
> exit
> [tixxdz@fedora-kvm bin]$ cat ~/fedora-tree/etc/fedora-release 
> Fedora release 23 (Twenty Three)
> 0_not_a_sandboxed_app
> 48_not_a_sandboxed_app
> 70_not_a_sandboxed_app
> 1000_not_a_sandboxed_app

It's good to know, but most of the shiftfs bugs are in the vfs, so you
can actually test for them without having to enter a user namespace at
all becuse the uid/gid shifting occurs independently.

James

--=-qvKxEdYdco3xoTy+gOhg
Content-Type: application/x-shellscript; name="build-container"
Content-Description: 
Content-Disposition: inline; filename="build-container"
Content-Transfer-Encoding: 7bit

#!/bin/bash
set -x

arch=aarch64
rootpath=/home/jejb/containers
ctrl=/run/build-container
if [ ! -v userns ]; then
    export userns=0
fi
#if [ $(id -u) -ne 0 ]; then
#    unshare --user sleep 180 &
#    userpid=$!;
#    sudo $0 --userid $(id -u) --groupid $(id -g) --userpid $userpid "$@"
#    exit 0;
#fi
unshare --user sleep 180 &
userpid=$!

while true; do
    case $1 in
	--arch)
	    arch=$2; shift; shift ;;
	--rootpath)
	    rootpath=$2; shift; shift ;;
	--mountshift)
	    userns=1; shift ;;
	--userid)
	    uid=$2; shift; shift ;;
	--groupid)
	    gid=$2; shift; shift ;;
	--userpid)
	    userpid=/proc/$2; shift; shift ;;
	*)
	    break;;
    esac
done
if [ -z "$uid" ]; then
    uid=$(id -u)
fi
if [ -z "$gid" ]; then
    gid=$(id -g)
fi

ctroot=$rootpath/$arch
root=$ctrl/root-$arch
usernspath=$ctrl/userns

if [ "$1" == "in-ct" ]; then
    if [ $userns -eq 1 ]; then
	/home/jejb/git/bindfs/src/bindfs --map=0/10000:1/10001:2/10002:3/10003:4/10004:5/10005:6/10006:7/10007:8/10008:@0/@10000:@1/@10001 $ctroot $root
    else
	mount --bind $ctroot $root
    fi
    
    mkdir $root$ctrl
    for f in /home /var/tmp /sys /proc; do
	mount --rbind $f ${root}${f}
	mount --make-rprivate ${root}${f}
    done
    #mount --bind /usr/bin/qemu-$arch $root/qemu-$arch
    cd $root
    mkdir old-root
    pivot_root . old-root
    mount --make-rprivate /old-root
    umount -l /old-root
    rmdir /old-root
else
    if [ -e $ctrl/$arch ]; then
	echo "Error: $ctrl/$arch exists, is container running?"
	exit 1;
    fi
    if [ ! -d $ctrl ]; then
	sudo mkdir $ctrl || exit 1
    fi
    if grep -q "$ctrl tmpfs" /proc/self/mounts; then
	:
    else
	sudo mount -t tmpfs none $ctrl || exit 1
    fi

    if [ ! -e $usernspath ]; then
	touch $usernspath
	# userns are annoying.  the maps must be written all at once and
	# shell echo won't, so we trick awk into doing it
	#echo 1| awk "{print \"0 100000 1000\n${uid} ${uid} 1\n65534 65534 2\"}" > $userpid/uid_map
	newuidmap $userpid 0 100000 1000 ${uid} ${uid} 1 65534 101000 1
	if [ $gid -le 1000 ]; then
	    gidn=$[$gid + 1]
	    gide=$[1000-$gidn]
	    #echo 1| awk "{print \"0 100000 ${gid}\n${gid} ${gid} 1\n${gidn} 100${gidn} ${gide}\n65533 65533 3\"}" > $userpid/gid_map
	    newgidmap $userpid 0 100000 ${gid} ${gid} ${gid} 1 ${gidn} 100${gidn} ${gide} 65533 101000 2
	else
	    #echo 1| awk "{print \"0 100000 1000\n${gid} ${gid} 1\n65533 65533 3\"}" > $userpid/gid_map
	    newgidmap 0 100000 1000 ${gid} ${gid} 1 65533 101000 2
	fi
	sudo mount --bind /proc/$userpid/ns/user $usernspath
    fi
    

    mkdir $root
    sudo chown 100000.100000 $root 
    touch $ctrl/$arch
    # create the mount ns with owning user ns
    nsenter --user=$usernspath --preserve-credentials -S 0 unshare --mount sleep 10 &
    # timing problem here: need to allow nsenter time to begin executing
    sleep 1
    # can only mount on private propagation mount points
    sudo mount --make-rprivate $ctrl
    sudo mount --bind /proc/$!/ns/mnt $ctrl/$arch
    # enter the mount ns with true root, not the user ns
    nsenter --mount=$ctrl/$arch --user=$ctrl/userns --preserve-credentials -S 0 $0 --arch $arch --rootpath $rootpath in-ct
    sudo rmdir $root
fi

--=-qvKxEdYdco3xoTy+gOhg--