From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jann Horn Subject: Re: [PATCH RFC] Introduce new security.nscapability xattr Date: Wed, 20 Jan 2016 13:48:16 +0100 Message-ID: <20160120124816.GB32379@pc.thejh.net> References: <20151130224356.GA27972@mail.hallyn.com> <87two3w0el.fsf@x220.int.ebiederm.org> <20151204202116.GA4809@mail.hallyn.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="VrqPEDrXMn8OVzN4" Return-path: Content-Disposition: inline In-Reply-To: <20151204202116.GA4809@mail.hallyn.com> Sender: owner-linux-security-module@vger.kernel.org To: "Serge E. Hallyn" Cc: "Eric W. Biederman" , "Serge E. Hallyn" , lkml , Andrew Morgan , Andy Lutomirski , lxc-devel@lists.linuxcontainers.org, Richard Weinberger , LSM , linux-api@vger.kernel.org, keescook@chromium.org List-Id: linux-api@vger.kernel.org --VrqPEDrXMn8OVzN4 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Fri, Dec 04, 2015 at 02:21:16PM -0600, Serge E. Hallyn wrote: > Quoting Eric W. Biederman (ebiederm@xmission.com): > > "Serge E. Hallyn" writes: > >=20 > > > A common way for daemons to run with minimal privilege is to start as= root, > > > perhaps setuid-root, choose a desired capability set, set PR_SET_KEEP= CAPS, > > > then change uid to non-root. A simpler way to achieve this is to set= file > > > capabilities on a not-setuid-root binary. However, when installing a= package > > > inside a (user-namespaced) container, packages cannot be installed wi= th file > > > capabilities. For this reason, containers must install ping setuid-r= oot. > >=20 > > Don't ping sockets avoid that specific problem? > >=20 > > I expect the general case still holds. > >=20 > > > To achieve this, we would need for containers to be able to request f= ile > > > capabilities be added to a file without causing these to be honored i= n the > > > initial user namespace. > > > > > > To this end, the patch below introduces a new capability xattr format= =2E The > > > main enhancement over the existing security.capability xattr is that = we > > > tag capability sets with a uid - the uid of the root user in the name= space > > > where the capabilities are set. The capabilities will be ignored in = any > > > other namespace. The special case of uid =3D=3D -1 (which must only = ever be > > > able to be set by kuid 0) means use the capabilities in all > > > namespaces. >=20 > really since security.capability xattrs are currently honored in all > namespaces this isn't really necessary. Until and unless Seth's set > changes that. >=20 > >=20 > > A quick comment on this. > >=20 > > We currently allow capabilities that have been gained to be valid in all > > descendent user namespaces. > >=20 > > Applying this principle to the on-disk capabilities would make it so > > that uid 0 would mean capabilities in all namespaces. > >=20 > > It might be worth it to introduce a fixed sized array with a length > > parameter of perhaps 32 entries which is a path of root uids as seen by > > the initial user namespace. That way the entire construction of the > > user namespace could be verified. AKA verify the current user namespace > > and the parent and the parents parent. Up to the user namespace the > > current filesystem is mounted in. We would look at how much space > > allows an xattr to be stored without causing filesystems a challenge > > to properly size such an array. > >=20 > > Given that uids are fundamentally flat that might not be particularly > > useful. If we add an alternative way of identifying user namespaces > > say a privileged operation that set a uuid, then the complete path would > > be more interesting. > >=20 > > > An alternative format would use a pair of uids to indicate a range of= rootids. > > > This would allow root in a user namespace with uids 100000-165536 map= ped to > > > set the xattr once on a file, then launch nested containers wherein t= he file > > > could be used with privilege. That's not what this patch does, but w= ould be > > > a trivial change if people think it would be worthwhile. > > > > > > This patch does not actually address the real problem, which is setti= ng the > > > xattrs from inside containers. For that, I think the best solution i= s to > > > add a pair of new system calls, setfcap and getfcap. Userspace would = for > > > instance call fsetfcap(fd, cap_user_header_t, cap_user_data_t), to wh= ich > > > the kernel would, if not in init_user_ns, react by writing an appropr= iate > > > security.nscapability xattr. > >=20 > > That feels hard to maintain, but you may be correct that we have a small > > enough userspace that it would not be a problem. > >=20 > > Eric > >=20 > >=20 > > > The libcap2 library's cap_set_file/cap_get_file could be switched over > > > transparently to use this to hide its use from all callers. > > > > > > Comments appreciated. > > > > > > Note - In this patch, file capabilities only work for containers whic= h have > > > a root uid defined. We may want to allow -1 uids to work in all > > > namespaces. There certainly would be uses for this, but I'm a bit un= settled > > > about the implications of allowing a program privilege in a container= where > > > there is no uid with privilege. This needs more thought. >=20 > So for actually enabling (user-namespaced) containers to use these, there > are a few possibilities that come to mine. >=20 > 1. A new setfcap (/getfcap) syscall. Uses mapped uid 0 from > current_user_ns() to write a value in the security.nscapability xattr. > Userspace doesn't need to worry at all about namespace issues. Your patch has an "ncaps" field and supports giving (possibly different) privileges to different namespace root users. I think that with a setfcap() syscall as you describe, it would be hard to add more than one security.nscapability entry to a file without poking holes into stuff: If setfcap() requires inode write access, every namespace root for whom a security.nscapability entry can be added can overwrite the file, so namespace roots might be able to execute code as each other. If setfcap() doesn't require inode write access, that isn't a big security issue, but it allows unprivileged users to waste disk space in a writable-mounted filesystem on which they have no write access anywhere. I'm not sure whether that is a sufficiently big problem for anyone to care. Another issue with this would be that it makes restoring sub-namespace capabilities from a backup somewhat ugly, at least if you're not in the init ns: You'd have to parse the capabilities attribute, then, for every attribute you want to restore whose rootid doesn't refer to your namespace, clone(CLONE_NEWUSER), map the target uid to 0 from the parent process, seteuid(0), do the setfcap() and exit. It would require less synchronization with my ptrace hardening patch that checks whether uids are mapped (if that lands, I don't think it has yet?) because that would allow you to first setuid, then clone(), map the uid to 0 inside the namespace and setfcap(), but it'd still require special-case code in backup software. > 2. Just expect userspace to write a xattr; kernel checks that no values > are changed for any other namespaces. This could be a lot of parsing and > verifying in the kernel. This option would allow the first problem I described with option 1 (if it is a problem?) to be worked around by letting a helper in the outer namespace add capabilities for lower namespaces - but it wouldn't be pretty. I think it would also allow tar, which already supports xattrs, to deal with namespace capabilities without needing any additional code. I think that would be nice to have. > 3. Switch the xattr scheme - instead of one security.nscapability xattr > with multiple entries, use security.nscapability.$(rootid). Now the > kernel only needs to verify that the $rootid is valid for the writing > task, and we don't need a new syscall. OTOH userspace needs to know > what it's doing. Of course we can still hide that behind libcap2's helpe= rs. This doesn't sound so good - how would the namespace know its rootid? If it is just one level below init_ns, it can grab it from /proc/self/uid_map, but for deeper levels that won't work. > Any opinions on which way seems best? 1 does seem cleanest (and supports > use of seccomp if we want to forbit its use by some containers), but > involves a new pair of syscalls. 2 seems to me to be right out, but > others might disagree... --VrqPEDrXMn8OVzN4 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBAgAGBQJWn4IQAAoJED4KNFJOeCOo3qkQAMY/7p5b+7p0QNtinEiy9Pch DbzmaOXbdgXjcum6BV7uSUy3zHEuO3OZu6I12JRRPOmsbqUgALZ9vBnWAbT48nqH wbqN8Ia8Rf7fpnqdRhY8S3hm2LYikHeSlXWE4C5vFAJgN1eDXycL4S6XKTZK54gW EIIDrfsUiDUpLLaQQCXpA+TUKqfeVIHcAlOwbDhZBJwHd6O3HeFakxyrbo8OM2Gi 8B4YhPK9joblWtXfPgFTVmYIWn4SfkKNy+HI4DClr3byQrRQzeNBsIfmIZTv2vfr YRVQEpK4MoSrnz2+WhPjedyttZsuqsvYPqb06YJaISSLttJU5DiXOacpu/b7ugb6 rDcKDtr218uZDhxTyVqPyj+xS4OP69WGCwlKHxtwWAOzIFCQLG5vZhuzpGkcWP06 hDBSB8hHil3IuUOa+DZG9AGnVVYYooik/S+hIxb22j92YOiBvkafHUnynOgwGJ32 emQd+2HH0rElPRqSQdx0mbnzM8pPBQTvjEceTH/c4R1nSBOsjn1/FrXDG+ZMLFd6 Rx9tyYsJyhqc9uxmVjTyhJMnxLwbMA1Xh9tY1QAJ6Uac8P17TGA4d1JXOuxkuRWh PyHodKaUgVlt8Rc2wJEH9IBhmkx4yw76JHv2o4cBEpOhEsGjvFtpy140ETSykLei 93ZCDYP5sxg9wqGEycTN =raYd -----END PGP SIGNATURE----- --VrqPEDrXMn8OVzN4--