From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752371AbcEQWv3 (ORCPT ); Tue, 17 May 2016 18:51:29 -0400 Received: from out01.mta.xmission.com ([166.70.13.231]:45955 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751906AbcEQWv0 (ORCPT ); Tue, 17 May 2016 18:51:26 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: James Bottomley Cc: Djalal Harouni , Alexander Viro , Chris Mason , tytso@mit.edu, Serge Hallyn , Josh Triplett , Andy Lutomirski , Seth Forshee , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-security-module@vger.kernel.org, Dongsu Park , David Herrmann , Miklos Szeredi , Alban Crequy , Dave Chinner In-Reply-To: <1463425996.4101.14.camel@HansenPartnership.com> (James Bottomley's message of "Mon, 16 May 2016 15:13:16 -0400") References: <1462395979.14310.133.camel@HansenPartnership.com> <20160505073636.GA3357@dztty> <1462449388.2419.27.camel@HansenPartnership.com> <20160505214957.GA3071@dztty> <1462486085.2289.23.camel@HansenPartnership.com> <1462923416.14896.10.camel@HansenPartnership.com> <20160511164247.GA9908@dztty.fritz.box> <1462991618.2356.55.camel@HansenPartnership.com> <20160512195552.GB2859@dztty> <1463091852.2380.72.camel@HansenPartnership.com> <20160514095303.GA3476@dztty> <1463233614.2355.20.camel@HansenPartnership.com> <87twi0giws.fsf@x220.int.ebiederm.org> <1463425996.4101.14.camel@HansenPartnership.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) Date: Tue, 17 May 2016 17:40:12 -0500 Message-ID: <87lh38xq9f.fsf@x220.int.ebiederm.org> MIME-Version: 1.0 Content-Type: text/plain X-XM-AID: U2FsdGVkX1+9DakcvIwnxjYLzHr9SVgeyf5DCx/AyH8= X-SA-Exim-Connect-IP: 97.119.107.188 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 1.5 TR_Symld_Words too many words that have symbols inside * 1.5 XMNoVowels Alpha-numberic number with no vowels * 0.7 XMSubLong Long Subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa07 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa07 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ***;James Bottomley X-Spam-Relay-Country: X-Spam-Timing: total 12838 ms - load_scoreonly_sql: 0.17 (0.0%), signal_user_changed: 6 (0.0%), b_tie_ro: 3.7 (0.0%), parse: 2.2 (0.0%), extract_message_metadata: 18 (0.1%), get_uri_detail_list: 3.9 (0.0%), tests_pri_-1000: 6 (0.0%), tests_pri_-950: 1.91 (0.0%), compile_eval: 0.45 (0.0%), tests_pri_-900: 1.26 (0.0%), tests_pri_-400: 37 (0.3%), check_bayes: 36 (0.3%), b_tokenize: 12 (0.1%), b_tok_get_all: 12 (0.1%), b_comp_prob: 4.6 (0.0%), b_tok_touch_all: 4.1 (0.0%), b_finish: 0.74 (0.0%), tests_pri_0: 451 (3.5%), check_dkim_signature: 0.63 (0.0%), check_dkim_adsp: 3.8 (0.0%), tests_pri_500: 12309 (95.9%), poll_dns_idle: 12294 (95.8%), rewrite_mail: 0.00 (0.0%) Subject: Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org James Bottomley writes: > On Sat, 2016-05-14 at 21:21 -0500, Eric W. Biederman wrote: >> James Bottomley writes: >> >> > On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote: >> >> Just a couple of quick comments from a very high level design point. >> >> - I think a shiftfs is valuable in the same way that overlayfs is >> valuable. >> >> Esepcially in the Docker case where a lot of containers want a shared >> base image (for efficiency), but it is desirable to run those >> containers in different user namespaces for safety. >> >> - It is also the plan to make it possible to mount a filesystem where >> the uids and gids of that filesystem on disk do not have a one to one >> mapping to kernel uids and gids. 99% of the work has already be done, >> for all filesystem except XFS. > > Can you elaborate a bit more on why we want to do this? I think only > having a single shift of uid_t to kuid_t across the kernel to user > boundary is a nice feature of user namespaces. Architecturally, it's > not such a big thing to do it as the data goes on to the disk as well, > but what's the use case for it? fuse/nfs or just plain sanity. As the data comes off disk we convert it into the kernel internal form kuid_t and kgid_t. For shiftfs this would be converting the uids when they come from your underlying filesystem to the upper level vfs abstractions. Converting to the kernel form for a filesystem such as fuse that is does all that is necessary to keep evil users from breaking the kernel means that we call allow users in a user namespace to mount fuse themselves. Supply whatever uids and gids they want in the fuse messages. If the uids/gids don't map from the mounting users user namespace into the kernel then we set inode->i_uid to INVALID_UID. That is all we ask of a filesystem, and we are sorting out the rest in the VFS as nothing sets INVALID_UID in inode->i_uid today. >> That said there are some significant issues to work through, before >> something like that can be enabled. >> >> * Handling of uids/gids on disk that don't map into a kuid/kgid. > > So I think this is nicely handled in the capability checks in > generic_permission() (capable_wrt_inode_uidgid()) is there a need to > make it more complex (and thus more error prone)? No just a need to handle INVALID_UID, and INVALID_GID which we don't handle today. >> * Safety from poisoned filesystem images. > > By poisoned FS image, you mean an image over whose internal data the > user has control? The basic problem of how do we give users write > access to data devices they can then cause to be mounted as > filesystems? Yes. For fuse except for uids and gids this is already solved for most other filesystems it is a whole new world of horror. The general case of evil usb devices (think android) that look like block devices but can return whatever they want already exists in the wild. >> I have slowly been working with Seth Forshee on these issues as >> the last thing I want is to introduce more security bugs right now. >> Seth being a braver man than I am has already merged his changes into >> the Ubuntu kernel. >> >> Right now we are targeting fuse, because fuse is already designed to >> handle poisoned filesystem images. So to safely enable this kind of >> mapping for fuse is not a giant step. >> >> The big thing from my point of view is to get the VFS interfaces >> correct so that the VFS handles all of the weird cases that come up >> with uids and gids that don't map, and any other weird cases. Keeping >> the weird bits out of the filesystems. > > If by VFS interfaces, you mean where we've already got the mapping > confined, absolutely. Yes. It is just making certain we handle INVALID_UID and INVALID_GID that results from a mapping failure. As we don't handle that in 4.6.0. >> James I think you are missing the fact that all filesystems already >> have the make_kuid and make_kgid calls right where the data comes off >> disk, > > I beg to differ: they certainly don't. The underlying filesystem > populates the inode in ->lookup with the data off the disk which goes > into the inode as a kuid_t/kgid_t It remains forever in the inode as > that. We convert it as it goes out of the kernel in the stat calls > (actually stat.c:cp_old/new_stat()) They do. i_uid_write calls make_kuid to map the in comming uid from disk into a kuid_t. That is all I was referring to. The only thing I am looking at infrastructure wise it to make it so that we cleanly handle when the first parameter to make_kuid is not &init_user_ns. That is the core point of Seths work. >> and the from_kuid and from_kgid calls right where the on-disk data >> is being created just before it goes on disk. Which means that the >> actual impact on filesystems of the translation is trivial. > > Are you looking at a different tree from me? I'm actually just looking > at Linus git head. Take a look at i_uid_read and i_gid_read. They are inline functions in fs.h Eric