From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753223AbcFFWCg (ORCPT ); Mon, 6 Jun 2016 18:02:36 -0400 Received: from bedivere.hansenpartnership.com ([66.63.167.143]:60080 "EHLO bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751170AbcFFWCe (ORCPT ); Mon, 6 Jun 2016 18:02:34 -0400 Message-ID: <1465250550.2393.91.camel@HansenPartnership.com> Subject: Re: [PATCH 0/1] shiftfs: uid/gid shifting filesystem From: James Bottomley To: Djalal Harouni Cc: =?UTF-8?Q?Micha=C5=82?= Zegan , Chris Mason , tytso@mit.edu, Serge Hallyn , Josh Triplett , "Eric W. Biederman" , Andy Lutomirski , Seth Forshee , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-security-module@vger.kernel.org, Dongsu Park , David Herrmann , Miklos Szeredi , Alban Crequy , Al Viro Date: Mon, 06 Jun 2016 15:02:30 -0700 In-Reply-To: <20160605211154.GA2901@dztty> References: <1464740984.7732.5.camel@HansenPartnership.com> <2b58696f-492c-3230-0a3c-2f6f9fbff931@poczta.onet.pl> <1464799260.2445.10.camel@HansenPartnership.com> <20160605211154.GA2901@dztty> Content-Type: multipart/mixed; boundary="=-qvKxEdYdco3xoTy+gOhg" X-Mailer: Evolution 3.16.5 Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --=-qvKxEdYdco3xoTy+gOhg Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit On Sun, 2016-06-05 at 22:11 +0100, Djalal Harouni wrote: > On Wed, Jun 01, 2016 at 12:41:00PM -0400, James Bottomley wrote: > > On Wed, 2016-06-01 at 18:21 +0200, MichaƂ Zegan wrote: > > > As I sent a reply in a ... wrong way, I do it again. my question > > > was: > > > Why isn't it done at the vfs layer when you mount the fs in > > > different > > > userns, instead of using a separate filesystem for it? > > > > Well, that is what this patch does: > > > > http://thread.gmane.org/gmane.linux.kernel/2214882 > > > > However, the reason it doesn't work for me is that I want to be > > able to > > unpack the image into a subdirectory (so I'm not dedicating a whole > > filesystem for this). This is primarily for a docker hack IBM is > > working on to allow each container instance to use a separate > > uid/gid > > range, so I need something that behaves much more like a bind > > mount. > I thought that you were using a loop device ? No, for Architectural emulation containers, I use file roots, so they're subdirectories of my home directory. The interesting issues Serge discovered are on ext4, which I needed a loop device to reproduce (my home directory is xfs) if that's where the confusion arises? Thinking about containers in general, a significant amount use bind mounted file roots because that's a nice use case that hypervisors can't match without clusterable filesystems. However, I do know some containers that are block image based, so whatever solution is chosen has to support both. > that's precisely one of the main case that's solved with that > solution... mount the portable fs image into a loop device, set the > shift which will be only active into that subdirectory... > > > > > I believe it could be useful to be able to mount all filesystems > > > in userns with autoshifted uids, although I do not know security > > > implications for that usage. > > > > As long as you don't need to subdivide the volume, it works nicely. > > However, from a security point of view, that entire volume is now > > effectively freely writeable by anyone who can set up a userns. If > > you follow the shiftfs route, you can break off writeable > > subdirectories for each namespace shift, but they can't cross over > > into writing subdirectories that belong to other user namespaces > > (assuming the uids are fully segregated). > > As said in the other email, I'm not really sure about the use case at > all... but I give you this quick test with: > https://gist.githubusercontent.com/tixxdz/6b84c2c3bd6cb987c82255602ec > 70f23/raw/97c9ab76878f9d7415583c00b22ca0e4a948847b/userns_test.c > > $ mkdir shifted-fedora-tree && sudo mount -t shiftfs > -ouidmap=0:1000000:65536,gidmap=0:1000000:65536 ~/fedora-tree/ > shifted-fedora-tree This is basically what I do for my container roots. However, after that I tend to set them up with scripts. I've attached my latest build -container script at the bottom. As you can see from my script, all my build containers are in /home/jejb/containers. > [tixxdz@fedora-kvm bin]$ sudo ./userns-test -m -U -M "0 1000000 1" > /bin/bash > uid=0(root) gid=65534(nfsnobody) groups=65534(nfsnobody) > context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > [root@fedora-kvm bin]# cat /proc/self/uid_map > 0 1000000 1 > [root@fedora-kvm bin]# echo "$(id -u)_not_a_sandboxed_app" >> shifted > -fedora-tree/etc/fedora-release > [root@fedora-kvm bin]# exit > exit > [tixxdz@fedora-kvm bin]$ sudo ./userns-test -m -U -M "48 1000000 1" > /bin/bash > [apache@fedora-kvm bin]$ id > uid=48(apache) gid=65534(nfsnobody) groups=65534(nfsnobody) > context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > [apache@fedora-kvm bin]$ echo "$(id -u)_not_a_sandboxed_app" >> > shifted-fedora-tree/etc/fedora-release > [apache@fedora-kvm bin]$ exit > exit > [tixxdz@fedora-kvm bin]$ sudo ./userns-test -m -U -M "70 1000000 1" > /bin/bash > [avahi@fedora-kvm bin]$ id > uid=70(avahi) gid=65534(nfsnobody) groups=65534(nfsnobody) > context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > [avahi@fedora-kvm bin]$ echo "$(id -u)_not_a_sandboxed_app" >> > shifted-fedora-tree/etc/fedora-release > [avahi@fedora-kvm bin]$ exit > exit > [tixxdz@fedora-kvm bin]$ sudo ./userns-test -m -U -M "1000 1000000 1" > /bin/bash > [tixxdz@fedora-kvm bin]$ id > uid=1000(tixxdz) gid=65534(nfsnobody) groups=65534(nfsnobody) > context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > [tixxdz@fedora-kvm bin]$ echo "$(id -u)_not_a_sandboxed_app" >> > shifted-fedora-tree/etc/fedora-release > [tixxdz@fedora-kvm bin]$ exit > exit > [tixxdz@fedora-kvm bin]$ cat ~/fedora-tree/etc/fedora-release > Fedora release 23 (Twenty Three) > 0_not_a_sandboxed_app > 48_not_a_sandboxed_app > 70_not_a_sandboxed_app > 1000_not_a_sandboxed_app It's good to know, but most of the shiftfs bugs are in the vfs, so you can actually test for them without having to enter a user namespace at all becuse the uid/gid shifting occurs independently. James --=-qvKxEdYdco3xoTy+gOhg Content-Type: application/x-shellscript; name="build-container" Content-Description: Content-Disposition: inline; filename="build-container" Content-Transfer-Encoding: 7bit #!/bin/bash set -x arch=aarch64 rootpath=/home/jejb/containers ctrl=/run/build-container if [ ! -v userns ]; then export userns=0 fi #if [ $(id -u) -ne 0 ]; then # unshare --user sleep 180 & # userpid=$!; # sudo $0 --userid $(id -u) --groupid $(id -g) --userpid $userpid "$@" # exit 0; #fi unshare --user sleep 180 & userpid=$! while true; do case $1 in --arch) arch=$2; shift; shift ;; --rootpath) rootpath=$2; shift; shift ;; --mountshift) userns=1; shift ;; --userid) uid=$2; shift; shift ;; --groupid) gid=$2; shift; shift ;; --userpid) userpid=/proc/$2; shift; shift ;; *) break;; esac done if [ -z "$uid" ]; then uid=$(id -u) fi if [ -z "$gid" ]; then gid=$(id -g) fi ctroot=$rootpath/$arch root=$ctrl/root-$arch usernspath=$ctrl/userns if [ "$1" == "in-ct" ]; then if [ $userns -eq 1 ]; then /home/jejb/git/bindfs/src/bindfs --map=0/10000:1/10001:2/10002:3/10003:4/10004:5/10005:6/10006:7/10007:8/10008:@0/@10000:@1/@10001 $ctroot $root else mount --bind $ctroot $root fi mkdir $root$ctrl for f in /home /var/tmp /sys /proc; do mount --rbind $f ${root}${f} mount --make-rprivate ${root}${f} done #mount --bind /usr/bin/qemu-$arch $root/qemu-$arch cd $root mkdir old-root pivot_root . old-root mount --make-rprivate /old-root umount -l /old-root rmdir /old-root else if [ -e $ctrl/$arch ]; then echo "Error: $ctrl/$arch exists, is container running?" exit 1; fi if [ ! -d $ctrl ]; then sudo mkdir $ctrl || exit 1 fi if grep -q "$ctrl tmpfs" /proc/self/mounts; then : else sudo mount -t tmpfs none $ctrl || exit 1 fi if [ ! -e $usernspath ]; then touch $usernspath # userns are annoying. the maps must be written all at once and # shell echo won't, so we trick awk into doing it #echo 1| awk "{print \"0 100000 1000\n${uid} ${uid} 1\n65534 65534 2\"}" > $userpid/uid_map newuidmap $userpid 0 100000 1000 ${uid} ${uid} 1 65534 101000 1 if [ $gid -le 1000 ]; then gidn=$[$gid + 1] gide=$[1000-$gidn] #echo 1| awk "{print \"0 100000 ${gid}\n${gid} ${gid} 1\n${gidn} 100${gidn} ${gide}\n65533 65533 3\"}" > $userpid/gid_map newgidmap $userpid 0 100000 ${gid} ${gid} ${gid} 1 ${gidn} 100${gidn} ${gide} 65533 101000 2 else #echo 1| awk "{print \"0 100000 1000\n${gid} ${gid} 1\n65533 65533 3\"}" > $userpid/gid_map newgidmap 0 100000 1000 ${gid} ${gid} 1 65533 101000 2 fi sudo mount --bind /proc/$userpid/ns/user $usernspath fi mkdir $root sudo chown 100000.100000 $root touch $ctrl/$arch # create the mount ns with owning user ns nsenter --user=$usernspath --preserve-credentials -S 0 unshare --mount sleep 10 & # timing problem here: need to allow nsenter time to begin executing sleep 1 # can only mount on private propagation mount points sudo mount --make-rprivate $ctrl sudo mount --bind /proc/$!/ns/mnt $ctrl/$arch # enter the mount ns with true root, not the user ns nsenter --mount=$ctrl/$arch --user=$ctrl/userns --preserve-credentials -S 0 $0 --arch $arch --rootpath $rootpath in-ct sudo rmdir $root fi --=-qvKxEdYdco3xoTy+gOhg--