Linux userland API discussions
 help / color / mirror / Atom feed
* Re: kdbus: add documentation
From: Florian Weimer @ 2014-11-30  8:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: arnd, ebiederm, gnomes, teg, jkosina, luto, linux-api,
	linux-kernel, daniel, dh.herrmann, tixxdz
In-Reply-To: <1416546149-24799-2-git-send-email-gregkh@linuxfoundation.org>

* Greg Kroah-Hartman:

> +The focus of this document is an overview of the low-level, native kernel D-Bus
> +transport called kdbus. Kdbus exposes its functionality via files in a
> +filesystem called 'kdbusfs'. All communication between processes takes place
> +via ioctls on files exposed through the mount point of a kdbusfs. The default
> +mount point of kdbusfs is /sys/fs/kdbus.

Does this mean the bus does not enforce the correctness of the D-Bus
introspection metadata?  That's really unfortunate.  Classic D-Bus
does not do this, either, and combined with the variety of approaches
used to implement D-Bus endpoints, it makes it really difficult to
figure out what D-Bus services, exactly, a process provides.

^ permalink raw reply

* Re: [RFC PATCH] proc, pidns: Add highpid
From: Florian Weimer @ 2014-11-30  8:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Herrmann, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Andrew Morton, systemd Mailing List, criu-GEFAQzZX7r8dnm+yROfE0A,
	Pavel Emelyanov, Cyrill Gorcunov
In-Reply-To: <b0f6c4df0e8ef8afcc7b786edecb4be8c752941e.1417215468.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

* Andy Lutomirski:

> The initial implementation is straightforward: highpid is simply a
> 64-bit counter. If a high-end system can fork every 3 ns (which
> would be amazing, given that just allocating a pid requires at
> atomic operation), it would take well over 1000 years for highpid to
> wrap.

I'm not sure if I'm reading the patch correctly, but is the counter
namespaced?  If yes, why?

^ permalink raw reply

* Re: [PATCH net-next 3/6] samples: bpf: example of stateful socket filtering
From: Alexei Starovoitov @ 2014-11-30  6:24 UTC (permalink / raw)
  To: David Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, Linux API,
	Network Development, LKML
In-Reply-To: <20141129.210158.2021042941461629799.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

On Sat, Nov 29, 2014 at 9:01 PM, David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> wrote:
> From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
> Date: Wed, 26 Nov 2014 21:42:28 -0800
>
>> this socket filter example does:
>> - creates arraymap in kernel with key 4 bytes and value 8 bytes
>>
>> - loads eBPF program:
>>   r0 = skb[14 + 9]; // load one byte of ip->proto
>  ...
>> +             BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),
>
> I do not want anything having to do with fixed offsets from
> the skb.
>
> Nothing should know where things are in the SKB structure,
> especially user facing things.
>
> That's why we have explicit BPF operations for fetching
> specific SKB members, so that the layout is completely
> transparent to the entity generating BPF programs.
>
> Besides retaining the flexibility of changing the SKB
> layout arbitrarily without breaking bpf programs, there
> are also security considerations from allowing bpf
> programs to load arbitrary offsets.
>
> Sorry, I do not like this patch series at all.

sigh.
You misunderstood one wrong comment in the whole series.
r0 = skb[14 + 9]; // load one byte of ip->proto
is saying
r0 = skb->data[14 + 9]; // load one byte of ip->proto
it's loading one byte of packet payload and
not one byte of skb structure.

There are plenty of other comments where I mentioned that.
Including cover letter:
"Though native eBPF programs are way more powerful than classic filters
(attachable through similar setsockopt() call), they don't have skb field
accessors yet. Like skb->pkt_type, skb->dev->ifindex are not accessible.
There are sevaral ways to achieve that. That will be in the next set of patches.
So in this set native eBPF programs can only read data from packet and
access maps"

and in patch #2:
+static bool sock_filter_is_valid_access(int off, int size, enum
bpf_access_type type)
+{
+       /* skb fields cannot be accessed yet */
+       return false;
+}

In other words, there is no way to access skb fields yet.
There are several different ways I've implemented to
access skb fields, but they all had their cons and pros, so I
dropped them all to focus on basic things first.
Which is: read packet data and access maps.

Please let me know if you want me to fix this comment
in this patch of sample code.

^ permalink raw reply

* Re: [PATCH net-next 3/6] samples: bpf: example of stateful socket filtering
From: David Miller @ 2014-11-30  5:01 UTC (permalink / raw)
  To: ast-uqk4Ao+rVK5Wk0Htik3J/w
  Cc: mingo-DgEjT+Ai2ygdnm+yROfE0A, luto-kltTT9wpgjJwATOyAt5JVQ,
	dborkman-H+wXaHxf7aLQT0dZR+AlfA,
	hannes-tFNcAqjVMyqKXQKiL6tip0B+6BGkLq7r,
	edumazet-hpIqsD4AKlfQT0dZR+AlfA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1417066951-1999-4-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

From: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
Date: Wed, 26 Nov 2014 21:42:28 -0800

> this socket filter example does:
> - creates arraymap in kernel with key 4 bytes and value 8 bytes
> 
> - loads eBPF program:
>   r0 = skb[14 + 9]; // load one byte of ip->proto
 ...
> +		BPF_LD_ABS(BPF_B, 14 + 9 /* R0 = ip->proto */),

I do not want anything having to do with fixed offsets from
the skb.

Nothing should know where things are in the SKB structure,
especially user facing things.

That's why we have explicit BPF operations for fetching
specific SKB members, so that the layout is completely
transparent to the entity generating BPF programs.

Besides retaining the flexibility of changing the SKB
layout arbitrarily without breaking bpf programs, there
are also security considerations from allowing bpf
programs to load arbitrary offsets.

Sorry, I do not like this patch series at all.

^ permalink raw reply

* Re: Why not make kdbus use CUSE?
From: Greg Kroah-Hartman @ 2014-11-29 17:59 UTC (permalink / raw)
  To: Richard Yao
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20141129063416.GE32286-WkJ4KB730YtOk+SH+krGketnjj8NnB7F@public.gmane.org>

On Sat, Nov 29, 2014 at 06:34:16AM +0000, Richard Yao wrote:
> I had the opportunity at LinuxCon Europe to chat with Greg and some other kdbus
> developers. A few things stood out from our conversation that I thought I would
> bring to the list for discussion.

Any reason why you didn't respond to the kdbus patches themselves?
Critiquing the specific code is much better than random discussions.

> They regard a userland compatibility shim in the systemd repostory to provide
> backward compatibility for applications. Unfortunately, this is insufficient to
> ensure compatibility because dependency trees have multiple levels. If cross
> platform package A depends on cross platform library B, which depends on dbus,
> and cross platform library B decides to switch to kdbus, then it ceases to be
> cross platform and cross platform package A is now dependent on Linux kernels
> with kdbus. Not only does that affect other POSIX systems, but it also affects
> LTS versions of Linux.

What does LTS versions have anything to do here?  And what specific
dependancies are you worried about?

> It is somewhat tempting to think that being in the kernel is necessary for
> performance, this does not appear to be true from my discussion with Greg and
> others. In specific, a key advantage of being in the kernel is a reduction in
> context switches and consequently, one would expect programs using the old API
> to benefit, but they were quite clear to me that programs using the old API do
> not benefit. At the same time, we had a similar situation where people thought
> that the httpd server had to be inside the kernel until Linux 2.6, when our
> userland APIs improved to the point where we were able to get similar if not
> better performance in userland compared to the implementation of khttpd in Linux
> 2.4.y.

Again, please see the kernel patches for lots of detail as to why this
should be in the kernel.  If you disagree with the specific statements I
have listed there, please respond with specifics.

> I started to think that we probably ought to design a way to put kdbus into
> userland and then I realized that we already have one in the form of CUSE. This
> would not only makes kdbus play nicely with SELinux and lxc, but also other
> POSIX systems that currently share dbus with Linux systems, which includes older
> Linux kernels. Greg claimed that the kdbus code was fairly self contained and
> was just a character device, so I assume this is possible and I am curious why
> it is not done.

The latest version is a filesystem not a character device, your
information is out of date :)

> P.S. I also mentioned my concern that having the shim in the systemd repository
> would have a negative effect on distributons that use alterntaive libc libraries
> because the systemd developers refuse to support alternative libc libraries. I
> mentioned this to one of the people to whom Greg introduced me (and whose name
> escapes me) as we were walking to Michael Kerrisk's talk on API design. I was
> told quite plainly that such distributions are not worth consideration. If kdbus
> is merged despite concerns about security and backward compatibility, could we
> at least have the shim moved to libc netural place, like Linus' tree?

Take that up on the systemd mailing list, it's not a kernel issue.

greg k-h

^ permalink raw reply

* [PATCH v2] userns: Disallow setgroups unless the gid_map writer is privileged
From: Andy Lutomirski @ 2014-11-29 17:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Josh Triplett, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel@vger.kernel.org, LSM, Casey Schaufler,
	Serge E. Hallyn, Richard Weinberger, kenton, Andy Lutomirski,
	stable

Classic unix permission checks have an interesting feature.  The
group permissions for a file can be set to less than the other
permissions on a file.  Occasionally this is used deliberately to
give a certain group of users fewer permissions than the default.

User namespaces break this usage.  Groups set in rgid or egid are
unaffected because an unprivileged user namespace creator can only
map a single group, so setresgid inside and outside the namespace
have the same effect.  However, an unprivileged user namespace
creator can currently use setgroups(2) to drop all supplementary
groups, so, if a supplementary group denies access to some resource,
user namespaces can be used to bypass that restriction.

To fix this issue, this introduces a new user namespace flag
USERNS_SETGROUPS_ALLOWED.  If that flag is not set, then
setgroups(2) will fail regardless of the caller's capabilities.

USERNS_SETGROUPS_ALLOWED is cleared in a new user namespace.  By
default, if the writer of gid_map has CAP_SETGID in the parent
userns and the parent userns has USERNS_SETGROUPS_ALLOWED, then the
USERNS_SETGROUPS_ALLOWED will be set in the child.  If the writer is
not so privileged, then writing to gid_map will fail unless the
writer adds "setgroups deny" to gid_map, in which case the check is
skipped but USERNS_SETGROUPS_ALLOWED will remain cleared.

The full semantics are:

If "setgroups allow" is present or no explicit "setgroups" setting
is written to gid_map, then writing to gid_map will fail with -EPERM
unless the opener and writer have CAP_SETGID in the parent namespace
and the parent namespace has USERNS_SETGROUPS_ALLOWED.

If "setgroups deny" is present, then writing gid_map will work as
before, but USERNS_SETGROUPS_ALLOWED will remain cleared.  This will
result in processes in the userns that have CAP_SETGID to be
nontheless unable to use setgroups(2).  If this breaks something
inside the userns, then this is okay -- the userns creator
specifically requested this behavior.

While it could be safe to set USERNS_SETGROUPS_ALLOWED if the user
namespace creator has no supplementary groups, doing so could be
surprising and could have unpleasant interactions with setns(2).

Any application that uses newgidmap(1) should be unaffected by this
fix, but unprivileged users that create user namespaces to
manipulate mounts or sandbox themselves will break until they start
using "setgroups deny".

This should fix CVE-2014-8989.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---

Unlike v1, this *will* break things like Sandstorm.  Fixing them will be
easy.  I agree that this will result in better long-term semantics, but
I'm not so happy about breaking working software.

If this is unpalatable, here's a different option: get rid of all these
permission checks and just change setgroups.  Specifically, make it so
that setgroups(2) in a userns will succeed but will silently refuse to
remove unmapped groups.

Changes from v1:
 - Userns flags are now properly atomic.
 - "setgroups allow" is now the default, so legacy unprivileged gid_map
   writers will start to fail.

 include/linux/user_namespace.h |  3 +++
 kernel/groups.c                |  3 +++
 kernel/user.c                  |  1 +
 kernel/user_namespace.c        | 42 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 49 insertions(+)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index e95372654f09..0ae4a8c97165 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -17,6 +17,8 @@ struct uid_gid_map {	/* 64 bytes -- 1 cache line */
 	} extent[UID_GID_MAP_MAX_EXTENTS];
 };
 
+#define USERNS_SETGROUPS_ALLOWED 0
+
 struct user_namespace {
 	struct uid_gid_map	uid_map;
 	struct uid_gid_map	gid_map;
@@ -27,6 +29,7 @@ struct user_namespace {
 	kuid_t			owner;
 	kgid_t			group;
 	unsigned int		proc_inum;
+	unsigned long		flags;
 
 	/* Register of per-UID persistent keyrings for this namespace */
 #ifdef CONFIG_PERSISTENT_KEYRINGS
diff --git a/kernel/groups.c b/kernel/groups.c
index 451698f86cfa..b5ec42423202 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -6,6 +6,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/user_namespace.h>
 #include <asm/uaccess.h>
 
 /* init to 2 - one for init_task, one to ensure it is never freed */
@@ -223,6 +224,8 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
+	if (!test_bit(USERNS_SETGROUPS_ALLOWED, &current_user_ns()->flags))
+		return -EPERM;
 	if (!ns_capable(current_user_ns(), CAP_SETGID))
 		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
diff --git a/kernel/user.c b/kernel/user.c
index 4efa39350e44..58fba8ea0845 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -51,6 +51,7 @@ struct user_namespace init_user_ns = {
 	.owner = GLOBAL_ROOT_UID,
 	.group = GLOBAL_ROOT_GID,
 	.proc_inum = PROC_USER_INIT_INO,
+	.flags = (1 << USERNS_SETGROUPS_ALLOWED),
 #ifdef CONFIG_PERSISTENT_KEYRINGS
 	.persistent_keyring_register_sem =
 	__RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index aa312b0dc3ec..1f63935483e9 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -601,6 +601,10 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	char *kbuf, *pos, *next_line;
 	ssize_t ret = -EINVAL;
 
+	bool may_setgroups = false;
+	bool setgroups_requested = true;
+	bool seen_explicit_setgroups = false;
+
 	/*
 	 * The id_map_mutex serializes all writes to any given map.
 	 *
@@ -633,6 +637,18 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
 		goto out;
 
+	if (map == &ns->gid_map) {
+		/*
+		 * Setgroups is permitted if the writer and the
+		 * parent ns are privileged.
+		 */
+		may_setgroups =
+			test_bit(USERNS_SETGROUPS_ALLOWED,
+				 &ns->parent->flags) &&
+			file_ns_capable(file, ns->parent, CAP_SETGID) &&
+			ns_capable(ns->parent, CAP_SETGID);
+	}
+
 	/* Get a buffer */
 	ret = -ENOMEM;
 	page = __get_free_page(GFP_TEMPORARY);
@@ -667,6 +683,23 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 				next_line = NULL;
 		}
 
+		/* Is this line a gid_map option? */
+		if (map == &ns->gid_map) {
+			if (!strcmp(pos, "setgroups deny")) {
+				if (seen_explicit_setgroups)
+					goto out;
+				seen_explicit_setgroups = true;
+				setgroups_requested = false;
+				continue;
+			} else if (!strcmp(pos, "setgroups allow")) {
+				if (seen_explicit_setgroups)
+					goto out;
+				seen_explicit_setgroups = true;
+				setgroups_requested = true;
+				continue;
+			}
+		}
+
 		pos = skip_spaces(pos);
 		extent->first = simple_strtoul(pos, &pos, 10);
 		if (!isspace(*pos))
@@ -741,6 +774,15 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 		extent->lower_first = lower_first;
 	}
 
+	/* Validate and install setgroups permission. */
+	if (map == &ns->gid_map && setgroups_requested) {
+		if (!may_setgroups) {
+			ret = -EPERM;
+			goto out;
+		}
+		set_bit(USERNS_SETGROUPS_ALLOWED, &ns->flags);
+	}
+
 	/* Install the map */
 	memcpy(map->extent, new_map.extent,
 		new_map.nr_extents*sizeof(new_map.extent[0]));
-- 
1.9.3


^ permalink raw reply related

* Re: [PATCH v6 37/46] tun: move internal flag defines out of uapi
From: Michael S. Tsirkin @ 2014-11-29 17:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, David Miller,
	cornelia.huck-tA70FqPdS9bQT0dZR+AlfA,
	rusty-8fk3Idey6ehBDgjK7y7TUQ, nab-IzHhD5pYlfBP7FQvKIMDCQ,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA,
	thuth-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dahi-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Zhi Yong Wu, Ben Hutchings,
	Herbert Xu, Tom Herbert, Masatake YAMATO, Xi Wang,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1417162052.5822.0-LLcwxoN42L4NLKR9yMNcA1aTQe2KTcn/@public.gmane.org>

On Fri, Nov 28, 2014 at 08:15:32AM +0008, Jason Wang wrote:
> 
> 
> On Fri, Nov 28, 2014 at 4:10 AM, Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >TUN_ flags are internal and never exposed
> >to userspace. Any application using it is almost
> >certainly buggy.
> >
> >Move them out to tun.c, we'll remove them in follow-up patches.
> >
> >Signed-off-by: Michael S. Tsirkin <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >---
> > include/uapi/linux/if_tun.h | 16 ++--------
> > drivers/net/tun.c           | 74
> >++++++++++++++-------------------------------
> > 2 files changed, 26 insertions(+), 64 deletions(-)
> >
> >diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
> >index e9502dd..277a260 100644
> >--- a/include/uapi/linux/if_tun.h
> >+++ b/include/uapi/linux/if_tun.h
> >@@ -22,21 +22,11 @@
> > /* Read queue size */
> > #define TUN_READQ_SIZE	500
> >-
> >-/* TUN device flags */
> >-#define TUN_TUN_DEV 	0x0001	
> >-#define TUN_TAP_DEV	0x0002
> >+/* TUN device type flags: deprecated. Use IFF_TUN/IFF_TAP instead. */
> >+#define TUN_TUN_DEV 	IFF_TUN
> >+#define TUN_TAP_DEV	IFF_TAP
> > #define TUN_TYPE_MASK   0x000f
> >-#define TUN_FASYNC	0x0010
> >-#define TUN_NOCHECKSUM	0x0020
> >-#define TUN_NO_PI	0x0040
> >-/* This flag has no real effect */
> >-#define TUN_ONE_QUEUE	0x0080
> >-#define TUN_PERSIST 	0x0100	
> >-#define TUN_VNET_HDR 	0x0200
> >-#define TUN_TAP_MQ      0x0400
> >-
> > /* Ioctl defines */
> > #define TUNSETNOCSUM  _IOW('T', 200, int)  #define TUNSETDEBUG
> >_IOW('T', 201, int) diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >index 9dd3746..bc89d07 100644
> >--- a/drivers/net/tun.c
> >+++ b/drivers/net/tun.c
> >@@ -103,6 +103,21 @@ do {								\
> > } while (0)
> > #endif
> >+/* TUN device flags */
> >+
> >+/* IFF_ATTACH_QUEUE is never stored in device flags,
> >+ * overload it to mean fasync when stored there.
> >+ */
> >+#define TUN_FASYNC	IFF_ATTACH_QUEUE
> >+#define TUN_NO_PI	IFF_NO_PI
> >+/* This flag has no real effect */
> >+#define TUN_ONE_QUEUE	IFF_ONE_QUEUE
> >+#define TUN_PERSIST 	IFF_PERSIST
> >+#define TUN_VNET_HDR 	IFF_VNET_HDR
> >+#define TUN_TAP_MQ      IFF_MULTI_QUEUE
> >+
> >+#define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
> >+		      IFF_MULTI_QUEUE)
> > #define GOODCOPY_LEN 128
> > #define FLT_EXACT_COUNT 8
> >@@ -1521,32 +1536,7 @@ static struct proto tun_proto = {
> > static int tun_flags(struct tun_struct *tun)
> > {
> >-	int flags = 0;
> >-
> >-	if (tun->flags & TUN_TUN_DEV)
> >-		flags |= IFF_TUN;
> >-	else
> >-		flags |= IFF_TAP;
> >-
> >-	if (tun->flags & TUN_NO_PI)
> >-		flags |= IFF_NO_PI;
> >-
> >-	/* This flag has no real effect.  We track the value for backwards
> >-	 * compatibility.
> >-	 */
> >-	if (tun->flags & TUN_ONE_QUEUE)
> >-		flags |= IFF_ONE_QUEUE;
> >-
> >-	if (tun->flags & TUN_VNET_HDR)
> >-		flags |= IFF_VNET_HDR;
> >-
> >-	if (tun->flags & TUN_TAP_MQ)
> >-		flags |= IFF_MULTI_QUEUE;
> >-
> >-	if (tun->flags & TUN_PERSIST)
> >-		flags |= IFF_PERSIST;
> >-
> >-	return flags;
> >+	return tun->flags & (TUN_FEATURES | IFF_PERSIST | IFF_TUN | IFF_TAP);
> > }
> >   static ssize_t tun_show_flags(struct device *dev, struct
> >device_attribute *attr,
> >@@ -1706,28 +1696,8 @@ static int tun_set_iff(struct net *net, struct file
> >*file, struct ifreq *ifr)
> > 	tun_debug(KERN_INFO, tun, "tun_set_iff\n");
> >-	if (ifr->ifr_flags & IFF_NO_PI)
> >-		tun->flags |= TUN_NO_PI;
> >-	else
> >-		tun->flags &= ~TUN_NO_PI;
> >-
> >-	/* This flag has no real effect.  We track the value for backwards
> >-	 * compatibility.
> >-	 */
> >-	if (ifr->ifr_flags & IFF_ONE_QUEUE)
> >-		tun->flags |= TUN_ONE_QUEUE;
> >-	else
> >-		tun->flags &= ~TUN_ONE_QUEUE;
> >-
> >-	if (ifr->ifr_flags & IFF_VNET_HDR)
> >-		tun->flags |= TUN_VNET_HDR;
> >-	else
> >-		tun->flags &= ~TUN_VNET_HDR;
> >-
> >-	if (ifr->ifr_flags & IFF_MULTI_QUEUE)
> >-		tun->flags |= TUN_TAP_MQ;
> >-	else
> >-		tun->flags &= ~TUN_TAP_MQ;
> >+	tun->flags = (tun->flags & ~TUN_FEATURES) |
> >+		(ifr->ifr_flags & TUN_FEATURES);
> > 	/* Make sure persistent devices do not get stuck in
> > 	 * xoff state.
> >@@ -1890,9 +1860,11 @@ static long __tun_chr_ioctl(struct file *file,
> >unsigned int cmd,
> > 	if (cmd == TUNGETFEATURES) {
> > 		/* Currently this just means: "what IFF flags are valid?".
> > 		 * This is needed because we never checked for invalid flags on
> >-		 * TUNSETIFF. */
> >-		return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE |
> >-				IFF_VNET_HDR | IFF_MULTI_QUEUE,
> >+		 * TUNSETIFF.  Why do we report IFF_TUN and IFF_TAP which are
> >+		 * not legal for TUNSETIFF here?  It's probably a bug, but it
> >+		 * doesn't seem to be worth fixing.
> >+		 */
> 
> The question is not very clear. IFF_TUN and IFF_TAP are legal for
> ifr.ifr_flags during TUNSETIFF. No?


Absolutely. I'm not sure what I meant anymore. I will drop this chunk -
either by replacing this patch if I have to post another version, or
by a separate patch if I don't.

> >
> >+		return put_user(IFF_TUN | IFF_TAP | TUN_FEATURES,
> > 				(unsigned int __user*)argp);
> > 	} else if (cmd == TUNSETQUEUE)
> > 		return tun_set_queue(file, &ifr);
> >-- 
> >MST
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe netdev" in
> >the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH] userns: Disallow setgroups unless the gid_map writer is privileged
From: Andy Lutomirski @ 2014-11-29 16:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Josh Triplett, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, LSM,
	Casey Schaufler, Serge E. Hallyn, Richard Weinberger,
	Kenton Varda, stable
In-Reply-To: <87vblym17s.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

On Sat, Nov 29, 2014 at 8:16 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>
> The patch is buggy.
>
> Nacked-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
>
>> ---
>>
>> Eric, this is an alternative to your patch.  I think it will cause
>> less breakage, and it will keep unprivileged user namespaces
>> more or less fully functional.
>>
>> Kenton, I think that neither run-bundle nor supervisor-main will be
>> broken by this patch.
>>
>>  include/linux/user_namespace.h |  3 +++
>>  kernel/groups.c                |  3 +++
>>  kernel/user.c                  |  1 +
>>  kernel/user_namespace.c        | 36 ++++++++++++++++++++++++++++++++++++
>>  4 files changed, 43 insertions(+)
>>
>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>> index e95372654f09..a74c1f3d44fe 100644
>> --- a/include/linux/user_namespace.h
>> +++ b/include/linux/user_namespace.h
>> @@ -17,6 +17,8 @@ struct uid_gid_map {        /* 64 bytes -- 1 cache line */
>>       } extent[UID_GID_MAP_MAX_EXTENTS];
>>  };
>>
>> +#define USERNS_SETGROUPS_ALLOWED 1
>> +
>>  struct user_namespace {
>>       struct uid_gid_map      uid_map;
>>       struct uid_gid_map      gid_map;
>> @@ -27,6 +29,7 @@ struct user_namespace {
>>       kuid_t                  owner;
>>       kgid_t                  group;
>>       unsigned int            proc_inum;
>> +     unsigned int            flags;
>
> If you are going to add a flags field it needs to be atomic as otherwise
> changing or reading individual flags won't be safe without a lock.

Will fix.

>
>>       /* Register of per-UID persistent keyrings for this namespace */
>>  #ifdef CONFIG_PERSISTENT_KEYRINGS
>> diff --git a/kernel/groups.c b/kernel/groups.c
>> index 451698f86cfa..e27433809978 100644
>> --- a/kernel/groups.c
>> +++ b/kernel/groups.c
>> @@ -6,6 +6,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/security.h>
>>  #include <linux/syscalls.h>
>> +#include <linux/user_namespace.h>
>>  #include <asm/uaccess.h>
>>
>>  /* init to 2 - one for init_task, one to ensure it is never freed */
>> @@ -223,6 +224,8 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
>>       struct group_info *group_info;
>>       int retval;
>>
>> +     if (!(current_user_ns()->flags & USERNS_SETGROUPS_ALLOWED))
>> +             return -EPERM;
>>       if (!ns_capable(current_user_ns(), CAP_SETGID))
>>               return -EPERM;
>>       if ((unsigned)gidsetsize > NGROUPS_MAX)
>> diff --git a/kernel/user.c b/kernel/user.c
>> index 4efa39350e44..f8cdb1ec6049 100644
>> --- a/kernel/user.c
>> +++ b/kernel/user.c
>> @@ -51,6 +51,7 @@ struct user_namespace init_user_ns = {
>>       .owner = GLOBAL_ROOT_UID,
>>       .group = GLOBAL_ROOT_GID,
>>       .proc_inum = PROC_USER_INIT_INO,
>> +     .flags = USERNS_SETGROUPS_ALLOWED,
>>  #ifdef CONFIG_PERSISTENT_KEYRINGS
>>       .persistent_keyring_register_sem =
>>       __RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index aa312b0dc3ec..6e7b9ee5bddc 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -600,6 +600,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>>       unsigned long page = 0;
>>       char *kbuf, *pos, *next_line;
>>       ssize_t ret = -EINVAL;
>> +     unsigned int gid_flags = 0;
>> +     bool seen_explicit_gid_flag = false;
>>
>>       /*
>>        * The id_map_mutex serializes all writes to any given map.
>> @@ -633,6 +635,19 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>>       if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
>>               goto out;
>>
>> +     /* Deal with supplementary groups. */
>> +     if (map == &ns->gid_map) {
>> +             /*
>> +              * By default, setgroups is allowed inside the userns
>> +              * if the writer has no supplementary groups (making
>> +              * it useless) or if the writer is privileged.
>> +              */
>> +             if ((ns->parent->flags & USERNS_SETGROUPS_ALLOWED) &&
>> +                 file_ns_capable(file, ns->parent, CAP_SETGID) &&
>> +                 ns_capable(ns->parent, CAP_SETGID))
>> +                     gid_flags = USERNS_SETGROUPS_ALLOWED;
>
> We can't do this.
>
> It is wrong to mix permissions and flags to request functionality.
>
> That way lies madness, and impossible maintenance, and it will silently
> break every application that expects setgroups to work if they have
> CAP_SETGID after a mapping has been established.

We can make writing gid_map fail unless you either have permissions or
add the "setgroups disallow" option.  Would that be better?  It avoids
silent creation of a strangely behaving userns.  This would be an easy
adjustment to this patch.

I don't like the "allow setgroups if there are no supplementary
groups" because it's weird and unexpected (and will therefore break
things, I expect) and because it may interact insecurely with setns
and other such things.

--Andy

>
>> +     }
>> +
>>       /* Get a buffer */
>>       ret = -ENOMEM;
>>       page = __get_free_page(GFP_TEMPORARY);
>> @@ -667,6 +682,25 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>>                               next_line = NULL;
>>               }
>>
>> +             /* Is this line a gid_map option? */
>> +             if (map == &ns->gid_map) {
>> +                     if (!strcmp(pos, "setgroups deny")) {
>> +                             if (seen_explicit_gid_flag)
>> +                                     goto out;
>> +                             seen_explicit_gid_flag = 1;
>> +                             gid_flags = 0;
>> +                             continue;
>> +                     } else if (!strcmp(pos, "setgroups allow")) {
>> +                             if (seen_explicit_gid_flag)
>> +                                     goto out;
>> +                             if (!(gid_flags & USERNS_SETGROUPS_ALLOWED)) {
>> +                                     ret = -EPERM;
>> +                                     goto out;
>> +                             }
>> +                             continue;
>> +                     }
>> +             }
>> +
>>               pos = skip_spaces(pos);
>>               extent->first = simple_strtoul(pos, &pos, 10);
>>               if (!isspace(*pos))
>> @@ -746,6 +780,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>>               new_map.nr_extents*sizeof(new_map.extent[0]));
>>       smp_wmb();
>>       map->nr_extents = new_map.nr_extents;
>> +     if (map == &ns->gid_map)
>> +             ns->flags |= gid_flags;
>>
>>       *ppos = count;
>>       ret = count;
>
> Eric



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH] userns: Disallow setgroups unless the gid_map writer is privileged
From: Eric W. Biederman @ 2014-11-29 16:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Containers, Josh Triplett, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel@vger.kernel.org, LSM, Casey Schaufler,
	Serge E. Hallyn, Richard Weinberger,
	kenton-AuYgBwuPrUQTaNkGU808tA, stable-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <81ba656ffbdc62a7f7e94ad17f0238ee063fe6b3.1417214430.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

The patch is buggy.

Nacked-by: "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>

> ---
>
> Eric, this is an alternative to your patch.  I think it will cause
> less breakage, and it will keep unprivileged user namespaces
> more or less fully functional.
>
> Kenton, I think that neither run-bundle nor supervisor-main will be
> broken by this patch.
>
>  include/linux/user_namespace.h |  3 +++
>  kernel/groups.c                |  3 +++
>  kernel/user.c                  |  1 +
>  kernel/user_namespace.c        | 36 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 43 insertions(+)
>
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index e95372654f09..a74c1f3d44fe 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -17,6 +17,8 @@ struct uid_gid_map {	/* 64 bytes -- 1 cache line */
>  	} extent[UID_GID_MAP_MAX_EXTENTS];
>  };
>  
> +#define USERNS_SETGROUPS_ALLOWED 1
> +
>  struct user_namespace {
>  	struct uid_gid_map	uid_map;
>  	struct uid_gid_map	gid_map;
> @@ -27,6 +29,7 @@ struct user_namespace {
>  	kuid_t			owner;
>  	kgid_t			group;
>  	unsigned int		proc_inum;
> +	unsigned int		flags;

If you are going to add a flags field it needs to be atomic as otherwise
changing or reading individual flags won't be safe without a lock.

>  	/* Register of per-UID persistent keyrings for this namespace */
>  #ifdef CONFIG_PERSISTENT_KEYRINGS
> diff --git a/kernel/groups.c b/kernel/groups.c
> index 451698f86cfa..e27433809978 100644
> --- a/kernel/groups.c
> +++ b/kernel/groups.c
> @@ -6,6 +6,7 @@
>  #include <linux/slab.h>
>  #include <linux/security.h>
>  #include <linux/syscalls.h>
> +#include <linux/user_namespace.h>
>  #include <asm/uaccess.h>
>  
>  /* init to 2 - one for init_task, one to ensure it is never freed */
> @@ -223,6 +224,8 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
>  	struct group_info *group_info;
>  	int retval;
>  
> +	if (!(current_user_ns()->flags & USERNS_SETGROUPS_ALLOWED))
> +		return -EPERM;
>  	if (!ns_capable(current_user_ns(), CAP_SETGID))
>  		return -EPERM;
>  	if ((unsigned)gidsetsize > NGROUPS_MAX)
> diff --git a/kernel/user.c b/kernel/user.c
> index 4efa39350e44..f8cdb1ec6049 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -51,6 +51,7 @@ struct user_namespace init_user_ns = {
>  	.owner = GLOBAL_ROOT_UID,
>  	.group = GLOBAL_ROOT_GID,
>  	.proc_inum = PROC_USER_INIT_INO,
> +	.flags = USERNS_SETGROUPS_ALLOWED,
>  #ifdef CONFIG_PERSISTENT_KEYRINGS
>  	.persistent_keyring_register_sem =
>  	__RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index aa312b0dc3ec..6e7b9ee5bddc 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -600,6 +600,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>  	unsigned long page = 0;
>  	char *kbuf, *pos, *next_line;
>  	ssize_t ret = -EINVAL;
> +	unsigned int gid_flags = 0;
> +	bool seen_explicit_gid_flag = false;
>  
>  	/*
>  	 * The id_map_mutex serializes all writes to any given map.
> @@ -633,6 +635,19 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>  	if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
>  		goto out;
>  
> +	/* Deal with supplementary groups. */
> +	if (map == &ns->gid_map) {
> +		/*
> +		 * By default, setgroups is allowed inside the userns
> +		 * if the writer has no supplementary groups (making
> +		 * it useless) or if the writer is privileged.
> +		 */
> +		if ((ns->parent->flags & USERNS_SETGROUPS_ALLOWED) &&
> +		    file_ns_capable(file, ns->parent, CAP_SETGID) &&
> +		    ns_capable(ns->parent, CAP_SETGID))
> +			gid_flags = USERNS_SETGROUPS_ALLOWED;

We can't do this.

It is wrong to mix permissions and flags to request functionality.

That way lies madness, and impossible maintenance, and it will silently
break every application that expects setgroups to work if they have
CAP_SETGID after a mapping has been established.

> +	}
> +
>  	/* Get a buffer */
>  	ret = -ENOMEM;
>  	page = __get_free_page(GFP_TEMPORARY);
> @@ -667,6 +682,25 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>  				next_line = NULL;
>  		}
>  
> +		/* Is this line a gid_map option? */
> +		if (map == &ns->gid_map) {
> +			if (!strcmp(pos, "setgroups deny")) {
> +				if (seen_explicit_gid_flag)
> +					goto out;
> +				seen_explicit_gid_flag = 1;
> +				gid_flags = 0;
> +				continue;
> +			} else if (!strcmp(pos, "setgroups allow")) {
> +				if (seen_explicit_gid_flag)
> +					goto out;
> +				if (!(gid_flags & USERNS_SETGROUPS_ALLOWED)) {
> +					ret = -EPERM;
> +					goto out;
> +				}
> +				continue;
> +			}
> +		}
> +
>  		pos = skip_spaces(pos);
>  		extent->first = simple_strtoul(pos, &pos, 10);
>  		if (!isspace(*pos))
> @@ -746,6 +780,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
>  		new_map.nr_extents*sizeof(new_map.extent[0]));
>  	smp_wmb();
>  	map->nr_extents = new_map.nr_extents;
> +	if (map == &ns->gid_map)
> +		ns->flags |= gid_flags;
>  
>  	*ppos = count;
>  	ret = count;

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC PATCH] proc, pidns: Add highpid
From: Cyrill Gorcunov @ 2014-11-29 16:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Herrmann, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Andrew Morton, systemd Mailing List
In-Reply-To: <b0f6c4df0e8ef8afcc7b786edecb4be8c752941e.1417215468.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

On Fri, Nov 28, 2014 at 03:05:01PM -0800, Andy Lutomirski wrote:
> Pid reuse is common, which means that it's difficult or impossible
> to read information about a pid from /proc without races.
> 
> This introduces a second number associated with each (task, pidns)
> pair called highpid.  Highpid is a 64-bit number, and, barring
> extremely unlikely circumstances or outright error, a (highpid, pid)
> will never be reused.
> 
> With just this change, a program can open /proc/PID/status, read the
> "Highpid" field, and confirm that it has the expected value.  If the
> pid has been reused, then highpid will be different.
> 
> The initial implementation is straightforward: highpid is simply a
> 64-bit counter. If a high-end system can fork every 3 ns (which
> would be amazing, given that just allocating a pid requires at
> atomic operation), it would take well over 1000 years for highpid to
> wrap.
> 
> For CRIU's benefit, the next highpid can be set by a privileged
> user.
> 
> NB: The sysctl stuff only works on 64-bit systems.  If the approach
> looks good, I'll fix that somehow.
> 
> Signed-off-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
> ---
> 
> If this goes in, there's plenty of room to add new interfaces to
> make this more useful.  For example, we could add a fancier tgkill
> that adds and validates hightgid and highpid, and we might want to
> add a syscall to read one's own hightgid and highpid.  These would
> be quite useful for pidfiles.
> 
> David, would this be useful for kdbus?
> 
> CRIU people: will this be unduly difficult to support in CRIU?

Hi Andy. I think it won't be hard to support.

^ permalink raw reply

* Re: [RFC PATCH] userns: Disallow setgroups unless the gid_map writer is privileged
From: serge-A9i7LUbDfNHQT0dZR+AlfA @ 2014-11-29 15:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-man, Kees Cook, Linux API, Linux Containers, Josh Triplett,
	stable-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	kenton-AuYgBwuPrUQTaNkGU808tA, LSM, Michael Kerrisk-manpages,
	Richard Weinberger, Casey Schaufler, Andrew Morton,
	Andy Lutomirski
In-Reply-To: <81ba656ffbdc62a7f7e94ad17f0238ee063fe6b3.1417214430.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

iiuc this should be ok for lxc since it always has a privileged map writer. 

(sorry I'm pretty much afk until dec 10)

Thanks,
- sergeOn 11/28/14 16:53 Andy Lutomirski wrote:
Classic unix permission checks have an interesting feature.  The
group permissions for a file can be set to less than the other
permissions on a file.  Occasionally this is used deliberately to
give a certain group of users fewer permissions than the default.

User namespaces break this usage.  Groups set in rgid or egid are
unaffected because an unprivileged user namespace creator can only
map a single group, so setresgid inside and outside the namespace
have the same effect.  However, an unprivileged user namespace
creator can currently use setgroups(2) to drop all supplementary
groups, so, if a supplementary group denies access to some resource,
user namespaces can be used to bypass that restriction.

To fix this issue without unduly breaking existing user code, this
introduces a new user namespace flag USERNS_SETGROUPS_ALLOWED.  If
that flag is not set, then setgroups(2) will fail regardless of the
caller's capabilities.

USERNS_SETGROUPS_ALLOWED is cleared in a new user namespace.  By
default, if the writer of gid_map has CAP_SETGID in the parent
userns and the parent userns has USERNS_SETGROUPS_ALLOWED, then the
USERNS_SETGROUPS_ALLOWED will be set in the child.

While it could be safe to set USERNS_SETGROUPS_ALLOWED if the user
namespace creator has no supplementary groups, doing so could be
surprising and could have unpleasant interactions with setns(2).

The default behavior can be overridden by adding a line of the form
"setgroups allow" or "setgroups deny" to gid_map.  The former will
return -EPERM if the caller is not permitted to set
USERNS_SETGROUPS_ALLOWED, and the latter will clear
USERNS_SETGROUPS_ALLOWED even if the flag would otherwise be set.

This change should break userspace by the minimal amount needed to
fix this issue.  Any application that uses newgidmap(1) should be
unaffected by this fix, and unprivileged users that create user
namespaces to manipulate mounts or sandbox themselves will also be
unaffected unless they try to use setgroups(2).

This should fix CVE-2014-8989.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
---

Eric, this is an alternative to your patch.  I think it will cause
less breakage, and it will keep unprivileged user namespaces
more or less fully functional.

Kenton, I think that neither run-bundle nor supervisor-main will be
broken by this patch.

 include/linux/user_namespace.h |  3 +++
 kernel/groups.c                |  3 +++
 kernel/user.c                  |  1 +
 kernel/user_namespace.c        | 36 ++++++++++++++++++++++++++++++++++++
 4 files changed, 43 insertions(+)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index e95372654f09..a74c1f3d44fe 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -17,6 +17,8 @@ struct uid_gid_map {	/* 64 bytes -- 1 cache line */
 	} extent[UID_GID_MAP_MAX_EXTENTS];
 };
 
+#define USERNS_SETGROUPS_ALLOWED 1
+
 struct user_namespace {
 	struct uid_gid_map	uid_map;
 	struct uid_gid_map	gid_map;
@@ -27,6 +29,7 @@ struct user_namespace {
 	kuid_t			owner;
 	kgid_t			group;
 	unsigned int		proc_inum;
+	unsigned int		flags;
 
 	/* Register of per-UID persistent keyrings for this namespace */
 #ifdef CONFIG_PERSISTENT_KEYRINGS
diff --git a/kernel/groups.c b/kernel/groups.c
index 451698f86cfa..e27433809978 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -6,6 +6,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/user_namespace.h>
 #include <asm/uaccess.h>
 
 /* init to 2 - one for init_task, one to ensure it is never freed */
@@ -223,6 +224,8 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
+	if (!(current_user_ns()->flags & USERNS_SETGROUPS_ALLOWED))
+		return -EPERM;
 	if (!ns_capable(current_user_ns(), CAP_SETGID))
 		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
diff --git a/kernel/user.c b/kernel/user.c
index 4efa39350e44..f8cdb1ec6049 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -51,6 +51,7 @@ struct user_namespace init_user_ns = {
 	.owner = GLOBAL_ROOT_UID,
 	.group = GLOBAL_ROOT_GID,
 	.proc_inum = PROC_USER_INIT_INO,
+	.flags = USERNS_SETGROUPS_ALLOWED,
 #ifdef CONFIG_PERSISTENT_KEYRINGS
 	.persistent_keyring_register_sem =
 	__RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index aa312b0dc3ec..6e7b9ee5bddc 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -600,6 +600,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	unsigned long page = 0;
 	char *kbuf, *pos, *next_line;
 	ssize_t ret = -EINVAL;
+	unsigned int gid_flags = 0;
+	bool seen_explicit_gid_flag = false;
 
 	/*
 	 * The id_map_mutex serializes all writes to any given map.
@@ -633,6 +635,19 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
 		goto out;
 
+	/* Deal with supplementary groups. */
+	if (map == &ns->gid_map) {
+		/*
+		 * By default, setgroups is allowed inside the userns
+		 * if the writer has no supplementary groups (making
+		 * it useless) or if the writer is privileged.
+		 */
+		if ((ns->parent->flags & USERNS_SETGROUPS_ALLOWED) &&
+		    file_ns_capable(file, ns->parent, CAP_SETGID) &&
+		    ns_capable(ns->parent, CAP_SETGID))
+			gid_flags = USERNS_SETGROUPS_ALLOWED;
+	}
+
 	/* Get a buffer */
 	ret = -ENOMEM;
 	page = __get_free_page(GFP_TEMPORARY);
@@ -667,6 +682,25 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 				next_line = NULL;
 		}
 
+		/* Is this line a gid_map option? */
+		if (map == &ns->gid_map) {
+			if (!strcmp(pos, "setgroups deny")) {
+				if (seen_explicit_gid_flag)
+					goto out;
+				seen_explicit_gid_flag = 1;
+				gid_flags = 0;
+				continue;
+			} else if (!strcmp(pos, "setgroups allow")) {
+				if (seen_explicit_gid_flag)
+					goto out;
+				if (!(gid_flags & USERNS_SETGROUPS_ALLOWED)) {
+					ret = -EPERM;
+					goto out;
+				}
+				continue;
+			}
+		}
+
 		pos = skip_spaces(pos);
 		extent->first = simple_strtoul(pos, &pos, 10);
 		if (!isspace(*pos))
@@ -746,6 +780,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 		new_map.nr_extents*sizeof(new_map.extent[0]));
 	smp_wmb();
 	map->nr_extents = new_map.nr_extents;
+	if (map == &ns->gid_map)
+		ns->flags |= gid_flags;
 
 	*ppos = count;
 	ret = count;
-- 
1.9.3

^ permalink raw reply related

* Re: [RFC PATCH] proc, pidns: Add highpid
From: Andy Lutomirski @ 2014-11-29 15:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: systemd Mailing List, Pavel Emelyanov, Linux API,
	linux-kernel@vger.kernel.org, criu, Cyrill Gorcunov,
	Andrew Morton
In-Reply-To: <87mw7aof2m.fsf@x220.int.ebiederm.org>

On Fri, Nov 28, 2014 at 7:34 PM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> Andy Lutomirski <luto@amacapital.net> writes:
>
>> Pid reuse is common, which means that it's difficult or impossible
>> to read information about a pid from /proc without races.
>
> Sigh.
>
> What we need are not race free pids, but a file descriptor based process
> management api.  Possibly one that starts by handing you a proc
> directory.

I agree completely, and the Capsicum stuff from FreeBSD would probably
be a very good start here.

>
> Which probably means that we need a proc file we can write to and send
> signals to a process, and another proc file we can select on and wait
> for the process to exit.

That too.  In fact, I have an old patch that went nowhere that makes
the proc directory fd itself pollable.

>
> Making pids bigger just looks like bandaid.
>
> Remember evovle things in the direction of an object capability system
> things wind up being more maintainable.

Yes, but this really is intended to be a much better bandaid than what
people currently use.

I'm aiming this at two main use cases that aren't going to switch to
using fds any time soon.  One is shell stuff and PID files.  If we can
put something like "12345@81726162" in /run/lock/foo.pid and safely
kill `cat /run/lock/foo.pid`, that would be nice (even though sensible
users don't use pid files any more, there are *tons* of things that
still use them).  This could also be used for diagnostics.  Suppose
the kernel log indicates a misbehaving pid.  There's currently no
race-free way to poke at those pids.

The much more pressing issue is that there are apparently programs
that think that checking the process's starttime is a good idea for
avoiding PID reuse races.  Both kdbus v1 and v2 do this, hopefully
only for diagnostics.  This is, IMO, completely unacceptable, but I
really don't expect kdbus to start passing even more fds around when
they'll be ignored most of the time.  So, if kdbus is going to be
merged at some point, I'd like to give it the opportunity to name the
sender of a message in a way that is only mildly dangerous (in non
race-related ways) as opposed to being totally broken.

Now if someone wants to implement real light-weight capability-ish fds
in Linux, that would be neat.  IIRC someone tried several years ago
using fds with some high bits set, but it never went anywhere.

FWIW, given that I seem to be the most vocal reviewer of the kdbus
patches, I feel like it's nice to offer some assistance in a piece of
the kdbus stuff that I think really would benefit from a new kernel
feature.

--Andy
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply

* Re: [RFC PATCH] proc, pidns: Add highpid
From: Andy Lutomirski @ 2014-11-29 15:19 UTC (permalink / raw)
  To: Greg KH
  Cc: systemd Mailing List, Linux API, linux-kernel@vger.kernel.org,
	Eric W. Biederman, Andrew Morton
In-Reply-To: <20141129042308.GB30450@kroah.com>

On Nov 28, 2014 9:24 PM, "Greg KH" <greg@kroah.com> wrote:
>
> On Fri, Nov 28, 2014 at 03:05:01PM -0800, Andy Lutomirski wrote:
> > Pid reuse is common, which means that it's difficult or impossible
> > to read information about a pid from /proc without races.
> >
> > This introduces a second number associated with each (task, pidns)
> > pair called highpid.  Highpid is a 64-bit number, and, barring
> > extremely unlikely circumstances or outright error, a (highpid, pid)
> > will never be reused.
> >
> > With just this change, a program can open /proc/PID/status, read the
> > "Highpid" field, and confirm that it has the expected value.  If the
> > pid has been reused, then highpid will be different.
> >
> > The initial implementation is straightforward: highpid is simply a
> > 64-bit counter. If a high-end system can fork every 3 ns (which
> > would be amazing, given that just allocating a pid requires at
> > atomic operation), it would take well over 1000 years for highpid to
> > wrap.
> >
> > For CRIU's benefit, the next highpid can be set by a privileged
> > user.
> >
> > NB: The sysctl stuff only works on 64-bit systems.  If the approach
> > looks good, I'll fix that somehow.
> >
> > Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> > ---
> >
> > If this goes in, there's plenty of room to add new interfaces to
> > make this more useful.  For example, we could add a fancier tgkill
> > that adds and validates hightgid and highpid, and we might want to
> > add a syscall to read one's own hightgid and highpid.  These would
> > be quite useful for pidfiles.
> >
> > David, would this be useful for kdbus?
> >
> > CRIU people: will this be unduly difficult to support in CRIU?
> >
> > If you all think this is a good idea, I'll fix the sysctl stuff and
> > document it.  It might also be worth adding "Hightgid" to status.
> >
> >  fs/proc/array.c               |  5 ++++-
> >  include/linux/pid.h           |  2 ++
> >  include/linux/pid_namespace.h |  1 +
> >  kernel/pid.c                  | 19 +++++++++++++++----
> >  kernel/pid_namespace.c        | 22 ++++++++++++++++++++++
> >  5 files changed, 44 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/proc/array.c b/fs/proc/array.c
> > index cd3653e4f35c..f1e0e69d19f9 100644
> > --- a/fs/proc/array.c
> > +++ b/fs/proc/array.c
> > @@ -159,6 +159,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> >       int g;
> >       struct fdtable *fdt = NULL;
> >       const struct cred *cred;
> > +     const struct upid *upid;
> >       pid_t ppid, tpid;
> >
> >       rcu_read_lock();
> > @@ -170,12 +171,14 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> >               if (tracer)
> >                       tpid = task_pid_nr_ns(tracer, ns);
> >       }
> > +     upid = pid_upid_ns(pid, ns);
> >       cred = get_task_cred(p);
> >       seq_printf(m,
> >               "State:\t%s\n"
> >               "Tgid:\t%d\n"
> >               "Ngid:\t%d\n"
> >               "Pid:\t%d\n"
> > +             "Highpid:\t%llu\n"
> >               "PPid:\t%d\n"
> >               "TracerPid:\t%d\n"
> >               "Uid:\t%d\t%d\t%d\t%d\n"
>
> Changing existing proc files like this is dangerous and can cause lots
> of breakage in userspace programs if you are not careful.  Usually
> adding fields to the end of the file is best, but sometimes even then
> things can break :(

Point taken.  I had hoped that everything would parse /proc/PID/status
by looking for the lamed headers.  I'll see what the history of new
fields in there is, but I can change this easily enough.

--Andy

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply

* Why not make kdbus use CUSE?
From: Richard Yao @ 2014-11-29  6:34 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Greg Kroah-Hartman, linux-api-u79uwXL29TY76Z2rM5mHXA

I had the opportunity at LinuxCon Europe to chat with Greg and some other kdbus
developers. A few things stood out from our conversation that I thought I would
bring to the list for discussion.

The first is that I asked them why we need to add yet another IPC mechanism (and
quite possibly another exploit target) to the kernel. Apparently, they want to
use dbus to do multicast and the existing dbus software is not peformant enough.
There was not much discussion of why the existing network stack is not usable
for this, but I was not terribly concerned about it, so the remainder of our
discussion focused on compatibility.

They regard a userland compatibility shim in the systemd repostory to provide
backward compatibility for applications. Unfortunately, this is insufficient to
ensure compatibility because dependency trees have multiple levels. If cross
platform package A depends on cross platform library B, which depends on dbus,
and cross platform library B decides to switch to kdbus, then it ceases to be
cross platform and cross platform package A is now dependent on Linux kernels
with kdbus. Not only does that affect other POSIX systems, but it also affects
LTS versions of Linux.

It is somewhat tempting to think that being in the kernel is necessary for
performance, this does not appear to be true from my discussion with Greg and
others. In specific, a key advantage of being in the kernel is a reduction in
context switches and consequently, one would expect programs using the old API
to benefit, but they were quite clear to me that programs using the old API do
not benefit. At the same time, we had a similar situation where people thought
that the httpd server had to be inside the kernel until Linux 2.6, when our
userland APIs improved to the point where we were able to get similar if not
better performance in userland compared to the implementation of khttpd in Linux
2.4.y.

Putting daemons in the kernel is always more performant than putting daemons
into userland, but it has the drawback of violating the principle of least
privilege. When code is in userland, we can apply security mechanisms to it via
things like SELinux and seccomp to limit the damage caused by compromise.  With
an in-kernel component, there is no way of doing that. One might be tempted to
think that controlling the IPC mechanism is as good as controlling the system,
but this is not true when we consider things like lxc, where compromise of dbus
in a container does not give full control over the system.

I started to think that we probably ought to design a way to put kdbus into
userland and then I realized that we already have one in the form of CUSE. This
would not only makes kdbus play nicely with SELinux and lxc, but also other
POSIX systems that currently share dbus with Linux systems, which includes older
Linux kernels. Greg claimed that the kdbus code was fairly self contained and
was just a character device, so I assume this is possible and I am curious why
it is not done.

I should probably mention one other thing that I recall from my discussion with
Greg and others, which is that the systemd project wants to depend on it. The
nature of controlling pid 1 means that systemd is more than capable of starting
dbus before anything that needs it and that includes its own components (aside
from its pid 1). The systemd project wanting the API is not a valid reason for
why it should be in the kernel, although it could be a reason to make a CUSE
version go into systemd's pid 1.

That said, why not make kdbus use CUSE?

P.S. I also mentioned my concern that having the shim in the systemd repository
would have a negative effect on distributons that use alterntaive libc libraries
because the systemd developers refuse to support alternative libc libraries. I
mentioned this to one of the people to whom Greg introduced me (and whose name
escapes me) as we were walking to Michael Kerrisk's talk on API design. I was
told quite plainly that such distributions are not worth consideration. If kdbus
is merged despite concerns about security and backward compatibility, could we
at least have the shim moved to libc netural place, like Linus' tree?

^ permalink raw reply

* Charity Work
From: luv2charitys-Re5JQEeQqe8AvxtiuMwx3w @ 2014-11-29  4:54 UTC (permalink / raw)
  To: linux-api-u79uwXL29TY76Z2rM5mHXA

Hello,this is Mr Paul N,i sent you an email on charity work but i am yet to hear fom you,do reply with this code CHA-2015 to my email address paulcharity-9uewiaClKEY@public.gmane.org  i Look forward to hearing from you this time,God bless  Brother Paul

^ permalink raw reply

* Re: [RFC PATCH] proc, pidns: Add highpid
From: Greg KH @ 2014-11-29  4:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Herrmann, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Eric W. Biederman,
	Andrew Morton, systemd Mailing List
In-Reply-To: <b0f6c4df0e8ef8afcc7b786edecb4be8c752941e.1417215468.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

On Fri, Nov 28, 2014 at 03:05:01PM -0800, Andy Lutomirski wrote:
> Pid reuse is common, which means that it's difficult or impossible
> to read information about a pid from /proc without races.
> 
> This introduces a second number associated with each (task, pidns)
> pair called highpid.  Highpid is a 64-bit number, and, barring
> extremely unlikely circumstances or outright error, a (highpid, pid)
> will never be reused.
> 
> With just this change, a program can open /proc/PID/status, read the
> "Highpid" field, and confirm that it has the expected value.  If the
> pid has been reused, then highpid will be different.
> 
> The initial implementation is straightforward: highpid is simply a
> 64-bit counter. If a high-end system can fork every 3 ns (which
> would be amazing, given that just allocating a pid requires at
> atomic operation), it would take well over 1000 years for highpid to
> wrap.
> 
> For CRIU's benefit, the next highpid can be set by a privileged
> user.
> 
> NB: The sysctl stuff only works on 64-bit systems.  If the approach
> looks good, I'll fix that somehow.
> 
> Signed-off-by: Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
> ---
> 
> If this goes in, there's plenty of room to add new interfaces to
> make this more useful.  For example, we could add a fancier tgkill
> that adds and validates hightgid and highpid, and we might want to
> add a syscall to read one's own hightgid and highpid.  These would
> be quite useful for pidfiles.
> 
> David, would this be useful for kdbus?
> 
> CRIU people: will this be unduly difficult to support in CRIU?
> 
> If you all think this is a good idea, I'll fix the sysctl stuff and
> document it.  It might also be worth adding "Hightgid" to status.
> 
>  fs/proc/array.c               |  5 ++++-
>  include/linux/pid.h           |  2 ++
>  include/linux/pid_namespace.h |  1 +
>  kernel/pid.c                  | 19 +++++++++++++++----
>  kernel/pid_namespace.c        | 22 ++++++++++++++++++++++
>  5 files changed, 44 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index cd3653e4f35c..f1e0e69d19f9 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -159,6 +159,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>  	int g;
>  	struct fdtable *fdt = NULL;
>  	const struct cred *cred;
> +	const struct upid *upid;
>  	pid_t ppid, tpid;
>  
>  	rcu_read_lock();
> @@ -170,12 +171,14 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>  		if (tracer)
>  			tpid = task_pid_nr_ns(tracer, ns);
>  	}
> +	upid = pid_upid_ns(pid, ns);
>  	cred = get_task_cred(p);
>  	seq_printf(m,
>  		"State:\t%s\n"
>  		"Tgid:\t%d\n"
>  		"Ngid:\t%d\n"
>  		"Pid:\t%d\n"
> +		"Highpid:\t%llu\n"
>  		"PPid:\t%d\n"
>  		"TracerPid:\t%d\n"
>  		"Uid:\t%d\t%d\t%d\t%d\n"

Changing existing proc files like this is dangerous and can cause lots
of breakage in userspace programs if you are not careful.  Usually
adding fields to the end of the file is best, but sometimes even then
things can break :(

^ permalink raw reply

* Re: [RFC PATCH] proc, pidns: Add highpid
From: Eric W. Biederman @ 2014-11-29  3:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Herrmann, linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	systemd Mailing List, criu-GEFAQzZX7r8dnm+yROfE0A,
	Pavel Emelyanov, Cyrill Gorcunov
In-Reply-To: <b0f6c4df0e8ef8afcc7b786edecb4be8c752941e.1417215468.git.luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> Pid reuse is common, which means that it's difficult or impossible
> to read information about a pid from /proc without races.

Sigh.

What we need are not race free pids, but a file descriptor based process
management api.  Possibly one that starts by handing you a proc
directory.

Which probably means that we need a proc file we can write to and send
signals to a process, and another proc file we can select on and wait
for the process to exit.

Making pids bigger just looks like bandaid.

Remember evovle things in the direction of an object capability system
things wind up being more maintainable.

Eric

^ permalink raw reply

* Re: [RFC PATCH] proc, pidns: Add highpid
From: Andy Lutomirski @ 2014-11-28 23:06 UTC (permalink / raw)
  To: David Herrmann, Linux API, linux-kernel@vger.kernel.org,
	Eric W. Biederman, criu, Pavel Emelyanov, Cyrill Gorcunov
  Cc: Andrew Morton, systemd Mailing List, Andy Lutomirski
In-Reply-To: <b0f6c4df0e8ef8afcc7b786edecb4be8c752941e.1417215468.git.luto@amacapital.net>

[Adding CRIU people.  Whoops.]

On Fri, Nov 28, 2014 at 3:05 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Pid reuse is common, which means that it's difficult or impossible
> to read information about a pid from /proc without races.
>
> This introduces a second number associated with each (task, pidns)
> pair called highpid.  Highpid is a 64-bit number, and, barring
> extremely unlikely circumstances or outright error, a (highpid, pid)
> will never be reused.
>
> With just this change, a program can open /proc/PID/status, read the
> "Highpid" field, and confirm that it has the expected value.  If the
> pid has been reused, then highpid will be different.
>
> The initial implementation is straightforward: highpid is simply a
> 64-bit counter. If a high-end system can fork every 3 ns (which
> would be amazing, given that just allocating a pid requires at
> atomic operation), it would take well over 1000 years for highpid to
> wrap.
>
> For CRIU's benefit, the next highpid can be set by a privileged
> user.
>
> NB: The sysctl stuff only works on 64-bit systems.  If the approach
> looks good, I'll fix that somehow.
>
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>
> If this goes in, there's plenty of room to add new interfaces to
> make this more useful.  For example, we could add a fancier tgkill
> that adds and validates hightgid and highpid, and we might want to
> add a syscall to read one's own hightgid and highpid.  These would
> be quite useful for pidfiles.
>
> David, would this be useful for kdbus?
>
> CRIU people: will this be unduly difficult to support in CRIU?
>
> If you all think this is a good idea, I'll fix the sysctl stuff and
> document it.  It might also be worth adding "Hightgid" to status.
>
>  fs/proc/array.c               |  5 ++++-
>  include/linux/pid.h           |  2 ++
>  include/linux/pid_namespace.h |  1 +
>  kernel/pid.c                  | 19 +++++++++++++++----
>  kernel/pid_namespace.c        | 22 ++++++++++++++++++++++
>  5 files changed, 44 insertions(+), 5 deletions(-)
>
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index cd3653e4f35c..f1e0e69d19f9 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -159,6 +159,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>         int g;
>         struct fdtable *fdt = NULL;
>         const struct cred *cred;
> +       const struct upid *upid;
>         pid_t ppid, tpid;
>
>         rcu_read_lock();
> @@ -170,12 +171,14 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>                 if (tracer)
>                         tpid = task_pid_nr_ns(tracer, ns);
>         }
> +       upid = pid_upid_ns(pid, ns);
>         cred = get_task_cred(p);
>         seq_printf(m,
>                 "State:\t%s\n"
>                 "Tgid:\t%d\n"
>                 "Ngid:\t%d\n"
>                 "Pid:\t%d\n"
> +               "Highpid:\t%llu\n"
>                 "PPid:\t%d\n"
>                 "TracerPid:\t%d\n"
>                 "Uid:\t%d\t%d\t%d\t%d\n"
> @@ -183,7 +186,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>                 get_task_state(p),
>                 task_tgid_nr_ns(p, ns),
>                 task_numa_group_id(p),
> -               pid_nr_ns(pid, ns),
> +               upid ? upid->nr : 0, upid ? upid->highnr : 0,
>                 ppid, tpid,
>                 from_kuid_munged(user_ns, cred->uid),
>                 from_kuid_munged(user_ns, cred->euid),
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 23705a53abba..ece70b64d04c 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -51,6 +51,7 @@ struct upid {
>         /* Try to keep pid_chain in the same cacheline as nr for find_vpid */
>         int nr;
>         struct pid_namespace *ns;
> +       u64 highnr;
>         struct hlist_node pid_chain;
>  };
>
> @@ -170,6 +171,7 @@ static inline pid_t pid_nr(struct pid *pid)
>  }
>
>  pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns);
> +const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns);
>  pid_t pid_vnr(struct pid *pid);
>
>  #define do_each_pid_task(pid, type, task)                              \
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 1997ffc295a7..1f9f0455d7ef 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -26,6 +26,7 @@ struct pid_namespace {
>         struct rcu_head rcu;
>         int last_pid;
>         unsigned int nr_hashed;
> +       atomic64_t next_highpid;
>         struct task_struct *child_reaper;
>         struct kmem_cache *pid_cachep;
>         unsigned int level;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 9b9a26698144..863d11a9bfbf 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -312,6 +312,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>
>                 pid->numbers[i].nr = nr;
>                 pid->numbers[i].ns = tmp;
> +               pid->numbers[i].highnr =
> +                       atomic64_inc_return(&tmp->next_highpid) - 1;
>                 tmp = tmp->parent;
>         }
>
> @@ -492,17 +494,26 @@ struct pid *find_get_pid(pid_t nr)
>  }
>  EXPORT_SYMBOL_GPL(find_get_pid);
>
> -pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
> +const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns)
>  {
>         struct upid *upid;
> -       pid_t nr = 0;
>
>         if (pid && ns->level <= pid->level) {
>                 upid = &pid->numbers[ns->level];
>                 if (upid->ns == ns)
> -                       nr = upid->nr;
> +                       return upid;
>         }
> -       return nr;
> +       return NULL;
> +}
> +EXPORT_SYMBOL_GPL(pid_upid_ns);
> +
> +pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
> +{
> +       const struct upid *upid = pid_upid_ns(pid, ns);
> +
> +       if (!upid)
> +               return 0;
> +       return upid->nr;
>  }
>  EXPORT_SYMBOL_GPL(pid_nr_ns);
>
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index db95d8eb761b..e246802b4361 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -268,6 +268,22 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
>         return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>  }
>
> +static int pid_ns_next_highpid_handler(struct ctl_table *table, int write,
> +               void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +       struct pid_namespace *pid_ns = task_active_pid_ns(current);
> +       struct ctl_table tmp = *table;
> +
> +       if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
> +               return -EPERM;
> +
> +       /* This needs to be fixed. */
> +       BUILD_BUG_ON(sizeof(u64) != sizeof(unsigned long));
> +
> +       tmp.data = &pid_ns->next_highpid;
> +       return proc_dointvec(&tmp, write, buffer, lenp, ppos);
> +}
> +
>  extern int pid_max;
>  static int zero = 0;
>  static struct ctl_table pid_ns_ctl_table[] = {
> @@ -279,6 +295,12 @@ static struct ctl_table pid_ns_ctl_table[] = {
>                 .extra1 = &zero,
>                 .extra2 = &pid_max,
>         },
> +       {
> +               .procname = "ns_next_highpid",
> +               .maxlen = sizeof(u64),
> +               .mode = 0666, /* permissions are checked in the handler */
> +               .proc_handler = pid_ns_next_highpid_handler,
> +       },
>         { }
>  };
>  static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
> --
> 1.9.3
>



-- 
Andy Lutomirski
AMA Capital Management, LLC
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply

* [RFC PATCH] proc, pidns: Add highpid
From: Andy Lutomirski @ 2014-11-28 23:05 UTC (permalink / raw)
  To: David Herrmann, linux-api, linux-kernel, Eric W. Biederman
  Cc: Andrew Morton, systemd Mailing List, Andy Lutomirski

Pid reuse is common, which means that it's difficult or impossible
to read information about a pid from /proc without races.

This introduces a second number associated with each (task, pidns)
pair called highpid.  Highpid is a 64-bit number, and, barring
extremely unlikely circumstances or outright error, a (highpid, pid)
will never be reused.

With just this change, a program can open /proc/PID/status, read the
"Highpid" field, and confirm that it has the expected value.  If the
pid has been reused, then highpid will be different.

The initial implementation is straightforward: highpid is simply a
64-bit counter. If a high-end system can fork every 3 ns (which
would be amazing, given that just allocating a pid requires at
atomic operation), it would take well over 1000 years for highpid to
wrap.

For CRIU's benefit, the next highpid can be set by a privileged
user.

NB: The sysctl stuff only works on 64-bit systems.  If the approach
looks good, I'll fix that somehow.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---

If this goes in, there's plenty of room to add new interfaces to
make this more useful.  For example, we could add a fancier tgkill
that adds and validates hightgid and highpid, and we might want to
add a syscall to read one's own hightgid and highpid.  These would
be quite useful for pidfiles.

David, would this be useful for kdbus?

CRIU people: will this be unduly difficult to support in CRIU?

If you all think this is a good idea, I'll fix the sysctl stuff and
document it.  It might also be worth adding "Hightgid" to status.

 fs/proc/array.c               |  5 ++++-
 include/linux/pid.h           |  2 ++
 include/linux/pid_namespace.h |  1 +
 kernel/pid.c                  | 19 +++++++++++++++----
 kernel/pid_namespace.c        | 22 ++++++++++++++++++++++
 5 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index cd3653e4f35c..f1e0e69d19f9 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -159,6 +159,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 	int g;
 	struct fdtable *fdt = NULL;
 	const struct cred *cred;
+	const struct upid *upid;
 	pid_t ppid, tpid;
 
 	rcu_read_lock();
@@ -170,12 +171,14 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		if (tracer)
 			tpid = task_pid_nr_ns(tracer, ns);
 	}
+	upid = pid_upid_ns(pid, ns);
 	cred = get_task_cred(p);
 	seq_printf(m,
 		"State:\t%s\n"
 		"Tgid:\t%d\n"
 		"Ngid:\t%d\n"
 		"Pid:\t%d\n"
+		"Highpid:\t%llu\n"
 		"PPid:\t%d\n"
 		"TracerPid:\t%d\n"
 		"Uid:\t%d\t%d\t%d\t%d\n"
@@ -183,7 +186,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 		get_task_state(p),
 		task_tgid_nr_ns(p, ns),
 		task_numa_group_id(p),
-		pid_nr_ns(pid, ns),
+		upid ? upid->nr : 0, upid ? upid->highnr : 0,
 		ppid, tpid,
 		from_kuid_munged(user_ns, cred->uid),
 		from_kuid_munged(user_ns, cred->euid),
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 23705a53abba..ece70b64d04c 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -51,6 +51,7 @@ struct upid {
 	/* Try to keep pid_chain in the same cacheline as nr for find_vpid */
 	int nr;
 	struct pid_namespace *ns;
+	u64 highnr;
 	struct hlist_node pid_chain;
 };
 
@@ -170,6 +171,7 @@ static inline pid_t pid_nr(struct pid *pid)
 }
 
 pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns);
+const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns);
 pid_t pid_vnr(struct pid *pid);
 
 #define do_each_pid_task(pid, type, task)				\
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 1997ffc295a7..1f9f0455d7ef 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -26,6 +26,7 @@ struct pid_namespace {
 	struct rcu_head rcu;
 	int last_pid;
 	unsigned int nr_hashed;
+	atomic64_t next_highpid;
 	struct task_struct *child_reaper;
 	struct kmem_cache *pid_cachep;
 	unsigned int level;
diff --git a/kernel/pid.c b/kernel/pid.c
index 9b9a26698144..863d11a9bfbf 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -312,6 +312,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 		pid->numbers[i].nr = nr;
 		pid->numbers[i].ns = tmp;
+		pid->numbers[i].highnr =
+			atomic64_inc_return(&tmp->next_highpid) - 1;
 		tmp = tmp->parent;
 	}
 
@@ -492,17 +494,26 @@ struct pid *find_get_pid(pid_t nr)
 }
 EXPORT_SYMBOL_GPL(find_get_pid);
 
-pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
+const struct upid *pid_upid_ns(struct pid *pid, struct pid_namespace *ns)
 {
 	struct upid *upid;
-	pid_t nr = 0;
 
 	if (pid && ns->level <= pid->level) {
 		upid = &pid->numbers[ns->level];
 		if (upid->ns == ns)
-			nr = upid->nr;
+			return upid;
 	}
-	return nr;
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(pid_upid_ns);
+
+pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns)
+{
+	const struct upid *upid = pid_upid_ns(pid, ns);
+
+	if (!upid)
+		return 0;
+	return upid->nr;
 }
 EXPORT_SYMBOL_GPL(pid_nr_ns);
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index db95d8eb761b..e246802b4361 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -268,6 +268,22 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
 	return proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
 }
 
+static int pid_ns_next_highpid_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct pid_namespace *pid_ns = task_active_pid_ns(current);
+	struct ctl_table tmp = *table;
+
+	if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN))
+		return -EPERM;
+
+	/* This needs to be fixed. */
+	BUILD_BUG_ON(sizeof(u64) != sizeof(unsigned long));
+
+	tmp.data = &pid_ns->next_highpid;
+	return proc_dointvec(&tmp, write, buffer, lenp, ppos);
+}
+
 extern int pid_max;
 static int zero = 0;
 static struct ctl_table pid_ns_ctl_table[] = {
@@ -279,6 +295,12 @@ static struct ctl_table pid_ns_ctl_table[] = {
 		.extra1 = &zero,
 		.extra2 = &pid_max,
 	},
+	{
+		.procname = "ns_next_highpid",
+		.maxlen = sizeof(u64),
+		.mode = 0666, /* permissions are checked in the handler */
+		.proc_handler = pid_ns_next_highpid_handler,
+	},
 	{ }
 };
 static struct ctl_path kern_path[] = { { .procname = "kernel", }, { } };
-- 
1.9.3

_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply related

* [RFC PATCH] userns: Disallow setgroups unless the gid_map writer is privileged
From: Andy Lutomirski @ 2014-11-28 22:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linux Containers, Josh Triplett, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel@vger.kernel.org, LSM, Casey Schaufler,
	Serge E. Hallyn, Richard Weinberger, kenton, Andy Lutomirski,
	stable

Classic unix permission checks have an interesting feature.  The
group permissions for a file can be set to less than the other
permissions on a file.  Occasionally this is used deliberately to
give a certain group of users fewer permissions than the default.

User namespaces break this usage.  Groups set in rgid or egid are
unaffected because an unprivileged user namespace creator can only
map a single group, so setresgid inside and outside the namespace
have the same effect.  However, an unprivileged user namespace
creator can currently use setgroups(2) to drop all supplementary
groups, so, if a supplementary group denies access to some resource,
user namespaces can be used to bypass that restriction.

To fix this issue without unduly breaking existing user code, this
introduces a new user namespace flag USERNS_SETGROUPS_ALLOWED.  If
that flag is not set, then setgroups(2) will fail regardless of the
caller's capabilities.

USERNS_SETGROUPS_ALLOWED is cleared in a new user namespace.  By
default, if the writer of gid_map has CAP_SETGID in the parent
userns and the parent userns has USERNS_SETGROUPS_ALLOWED, then the
USERNS_SETGROUPS_ALLOWED will be set in the child.

While it could be safe to set USERNS_SETGROUPS_ALLOWED if the user
namespace creator has no supplementary groups, doing so could be
surprising and could have unpleasant interactions with setns(2).

The default behavior can be overridden by adding a line of the form
"setgroups allow" or "setgroups deny" to gid_map.  The former will
return -EPERM if the caller is not permitted to set
USERNS_SETGROUPS_ALLOWED, and the latter will clear
USERNS_SETGROUPS_ALLOWED even if the flag would otherwise be set.

This change should break userspace by the minimal amount needed to
fix this issue.  Any application that uses newgidmap(1) should be
unaffected by this fix, and unprivileged users that create user
namespaces to manipulate mounts or sandbox themselves will also be
unaffected unless they try to use setgroups(2).

This should fix CVE-2014-8989.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---

Eric, this is an alternative to your patch.  I think it will cause
less breakage, and it will keep unprivileged user namespaces
more or less fully functional.

Kenton, I think that neither run-bundle nor supervisor-main will be
broken by this patch.

 include/linux/user_namespace.h |  3 +++
 kernel/groups.c                |  3 +++
 kernel/user.c                  |  1 +
 kernel/user_namespace.c        | 36 ++++++++++++++++++++++++++++++++++++
 4 files changed, 43 insertions(+)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index e95372654f09..a74c1f3d44fe 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -17,6 +17,8 @@ struct uid_gid_map {	/* 64 bytes -- 1 cache line */
 	} extent[UID_GID_MAP_MAX_EXTENTS];
 };
 
+#define USERNS_SETGROUPS_ALLOWED 1
+
 struct user_namespace {
 	struct uid_gid_map	uid_map;
 	struct uid_gid_map	gid_map;
@@ -27,6 +29,7 @@ struct user_namespace {
 	kuid_t			owner;
 	kgid_t			group;
 	unsigned int		proc_inum;
+	unsigned int		flags;
 
 	/* Register of per-UID persistent keyrings for this namespace */
 #ifdef CONFIG_PERSISTENT_KEYRINGS
diff --git a/kernel/groups.c b/kernel/groups.c
index 451698f86cfa..e27433809978 100644
--- a/kernel/groups.c
+++ b/kernel/groups.c
@@ -6,6 +6,7 @@
 #include <linux/slab.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
+#include <linux/user_namespace.h>
 #include <asm/uaccess.h>
 
 /* init to 2 - one for init_task, one to ensure it is never freed */
@@ -223,6 +224,8 @@ SYSCALL_DEFINE2(setgroups, int, gidsetsize, gid_t __user *, grouplist)
 	struct group_info *group_info;
 	int retval;
 
+	if (!(current_user_ns()->flags & USERNS_SETGROUPS_ALLOWED))
+		return -EPERM;
 	if (!ns_capable(current_user_ns(), CAP_SETGID))
 		return -EPERM;
 	if ((unsigned)gidsetsize > NGROUPS_MAX)
diff --git a/kernel/user.c b/kernel/user.c
index 4efa39350e44..f8cdb1ec6049 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -51,6 +51,7 @@ struct user_namespace init_user_ns = {
 	.owner = GLOBAL_ROOT_UID,
 	.group = GLOBAL_ROOT_GID,
 	.proc_inum = PROC_USER_INIT_INO,
+	.flags = USERNS_SETGROUPS_ALLOWED,
 #ifdef CONFIG_PERSISTENT_KEYRINGS
 	.persistent_keyring_register_sem =
 	__RWSEM_INITIALIZER(init_user_ns.persistent_keyring_register_sem),
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index aa312b0dc3ec..6e7b9ee5bddc 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -600,6 +600,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	unsigned long page = 0;
 	char *kbuf, *pos, *next_line;
 	ssize_t ret = -EINVAL;
+	unsigned int gid_flags = 0;
+	bool seen_explicit_gid_flag = false;
 
 	/*
 	 * The id_map_mutex serializes all writes to any given map.
@@ -633,6 +635,19 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 	if (cap_valid(cap_setid) && !file_ns_capable(file, ns, CAP_SYS_ADMIN))
 		goto out;
 
+	/* Deal with supplementary groups. */
+	if (map == &ns->gid_map) {
+		/*
+		 * By default, setgroups is allowed inside the userns
+		 * if the writer has no supplementary groups (making
+		 * it useless) or if the writer is privileged.
+		 */
+		if ((ns->parent->flags & USERNS_SETGROUPS_ALLOWED) &&
+		    file_ns_capable(file, ns->parent, CAP_SETGID) &&
+		    ns_capable(ns->parent, CAP_SETGID))
+			gid_flags = USERNS_SETGROUPS_ALLOWED;
+	}
+
 	/* Get a buffer */
 	ret = -ENOMEM;
 	page = __get_free_page(GFP_TEMPORARY);
@@ -667,6 +682,25 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 				next_line = NULL;
 		}
 
+		/* Is this line a gid_map option? */
+		if (map == &ns->gid_map) {
+			if (!strcmp(pos, "setgroups deny")) {
+				if (seen_explicit_gid_flag)
+					goto out;
+				seen_explicit_gid_flag = 1;
+				gid_flags = 0;
+				continue;
+			} else if (!strcmp(pos, "setgroups allow")) {
+				if (seen_explicit_gid_flag)
+					goto out;
+				if (!(gid_flags & USERNS_SETGROUPS_ALLOWED)) {
+					ret = -EPERM;
+					goto out;
+				}
+				continue;
+			}
+		}
+
 		pos = skip_spaces(pos);
 		extent->first = simple_strtoul(pos, &pos, 10);
 		if (!isspace(*pos))
@@ -746,6 +780,8 @@ static ssize_t map_write(struct file *file, const char __user *buf,
 		new_map.nr_extents*sizeof(new_map.extent[0]));
 	smp_wmb();
 	map->nr_extents = new_map.nr_extents;
+	if (map == &ns->gid_map)
+		ns->flags |= gid_flags;
 
 	*ppos = count;
 	ret = count;
-- 
1.9.3


^ permalink raw reply related

* Re: [PATCH v3 3/5] arm64: dts: Add support for Spreadtrum SC9836 SoC in dts and Makefile
From: Catalin Marinas @ 2014-11-28 17:24 UTC (permalink / raw)
  To: Mark Rutland
  Cc: andrew-g2DYL2Zd6BY@public.gmane.org,
	heiko-4mtYJXux2i+zQB+pC5nmwQ@public.gmane.org,
	gnomes-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org,
	Will Deacon, Chunyan Zhang,
	jason-NLaQJdtUoK4Be96aLqz0jA@public.gmane.org,
	lanqing.liu-lxIno14LUO0EEoCn2XhGlw@public.gmane.org,
	arnd-r2nGTMty4D4@public.gmane.org,
	corbet-T1hC0tSOHrs@public.gmane.org,
	zhang.lyra-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	zhizhou.zhang-lxIno14LUO0EEoCn2XhGlw@public.gmane.org,
	geng.ren-lxIno14LUO0EEoCn2XhGlw@public.gmane.org,
	m-karicheri2-l0cyMroinI0@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	linux-serial-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	grant.likely-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org,
	orsonzhai-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	"florian.vaussard-p8DiymsW2f8@public.gmane.org" <florian.vaussard>
In-Reply-To: <20141128164047.GL25883@leverpostej>

On Fri, Nov 28, 2014 at 04:40:47PM +0000, Mark Rutland wrote:
> [...]
> 
> > In the meantime I think we can be more tolerant:
> > 
> > diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
> > index 2133f9d59d06..87f67a93fcc7 100644
> > --- a/drivers/clocksource/arm_arch_timer.c
> > +++ b/drivers/clocksource/arm_arch_timer.c
> > @@ -376,6 +376,8 @@ arch_timer_detect_rate(void __iomem *cntbase, struct device_node *np)
> >  			arch_timer_rate = readl_relaxed(cntbase + CNTFRQ);
> >  		else
> >  			arch_timer_rate = arch_timer_get_cntfrq();
> > +	} else if (IS_ENABLED(CONFIG_ARM64)) {
> > +		pr_warn("Architected timer frequency overridden by DT (broken firmware?)\n");
> >  	}
> 
> That looks good to me. On a related note, it looks like my documentation
> patch [1] from a while back fell by the wayside. Does anyone fancy
> picking that up to go with this?

I can put the two together (keeping you as author). Do we still need the
IS_ENABLED(CONFIG_ARM64) part here or we extend the warning to to arm32
as well?

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [tpmdd-devel] [PATCH v7 08/10] tpm: TPM 2.0 CRB Interface
From: Stefan Berger @ 2014-11-28 17:23 UTC (permalink / raw)
  To: Jarkko Sakkinen
  Cc: Peter Huewe, Ashley Lai, Marcel Selhorst,
	christophe.ricard-Re5JQEeQqe8AvxtiuMwx3w,
	josh.triplett-ral2JQCrhuEAvxtiuMwx3w,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	tpmdd-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	jason.gunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	trousers-tech-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f
In-Reply-To: <20141127154023.GD24791-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On 11/27/2014 10:40 AM, Jarkko Sakkinen wrote:
> On Wed, Nov 26, 2014 at 09:06:57AM -0500, Stefan Berger wrote:
>> On 11/11/2014 08:45 AM, Jarkko Sakkinen wrote:
>>> tpm_crb is a driver for TPM 2.0 Command Response Buffer (CRB) Interface
>>> as defined in PC Client Platform TPM Profile (PTP) Specification.
>>>
>>> Only polling and single locality is supported as these are the limitations
>>> of the available hardware, Platform Trust Techonlogy (PTT) in Haswell
>>> CPUs.
>>>
>>> The driver always applies CRB with ACPI start because PTT reports using
>>> only ACPI start as start method but as a result of my testing it requires
>>> also CRB start.
>>>
>>> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
>>> ---
>>>
[...]
>>> +static int crb_send(struct tpm_chip *chip, u8 *buf, size_t len)
>>> +{
>>> +	struct crb_priv *priv = chip->vendor.priv;
>>> +	struct crb_control_area *cca;
>>> +	u8 *cmd;
>>> +	int rc = 0;
>>> +
>>> +	cca = priv->cca;
>>> +
>>> +	if (len > le32_to_cpu(cca->cmd_size)) {
>>> +		dev_err(&chip->dev,
>>> +			"invalid command count value %x %zx\n",
>>> +			(unsigned int) len,
>>> +			(size_t) le32_to_cpu(cca->cmd_size));
>>> +		return -E2BIG;
>>> +	}
>>> +
>>> +	cmd = (u8 *) ((unsigned long) cca + le64_to_cpu(cca->cmd_pa) -
>>> +		      priv->cca_pa);
>> cca = priv->cca per statement above     -> cmd = cca + x - cca = x
>>
>> -> cmd = le64_to_cpu(cca->cmd_pa);
>>
>> Should do the trick, no ?
> Virtual address might be different where CCA is ioremapped.

My bad. Is the 'Command Address' then guaranteed to lie within the 
remapped area of the 'Control Area' or should there be a check whether 
you would need a separate remapping? It's not clear from the specs where 
it lies (2.1 4th paragraph: "They can all be in system RAM, or all be in 
device memory, or any combination"). Same is true for the 'Response 
Address'.  Maybe you should have the ioremapped address for the 'Command 
Address' calculated elsewhere and simplify the address calculation here 
to basically cmd = priv->command_address.

Make the address calculations in cbr_recv and crb_send look the same.

[So, (cca - priv->cca_pa)  calculates the offset of the remapping. 
Adding to this the address of the command buffer gives the ioremapped 
address of the command buffer, assuming it lies within that remapped area. ]


     Stefan

>
>>> +	memcpy(cmd, buf, len);
>>> +
>>> +	/* Make sure that cmd is populated before issuing start. */
>>> +	wmb();
>>> +
>>> +	cca->start = cpu_to_le32(1);
>>> +	rc = crb_do_acpi_start(chip);

^ permalink raw reply

* Re: [CFT][PATCH] userns: Avoid problems with negative groups
From: Andy Lutomirski @ 2014-11-28 17:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-man, Kees Cook, Linux API, Linux Containers, Serge Hallyn,
	Josh Triplett,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, LSM,
	Michael Kerrisk-manpages, Richard Weinberger, Casey Schaufler,
	Andrew Morton
In-Reply-To: <874mtjp9m1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>

cc: Serge and Stephane, since this may end up affecting LXC.

On Fri, Nov 28, 2014 at 8:34 AM, Eric W. Biederman
<ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>
>> On Thu, Nov 27, 2014 at 9:21 PM, Eric W. Biederman
>> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>>
>>>>> This change should break userspace by the minimal amount needed
>>>>> to fix this issue.
>>>>>
>>>>> This should fix CVE-2014-8989.
>>>>
>>>> I think this is both unnecessarily restrictive and that it doesn't fix
>>>> the bug.
>>>
>>> You are going to have to work very hard to convince me this is
>>> unnecessarily restrictive.
>>>
>>>>For example, I can exploit CVE-2014-8989 without ever
>>>> writing a uid map or a gid map.
>>>
>>> Yes.  I realized just after I sent the patch that setgroups(0, NULL)
>>> would still work without a mapping set.  That is a first glass grade a
>>> oversight that resulted in a bug.  None of the other uid or gid changing
>>> syscalls without a mapping set, and setgroups was just overlooked
>>> because it was different.  Oops.
>>>
>>> I will send an updated patch that stops setgroups from working without
>>> a mapping set shortly.
>>>
>>>> IIUC, the only real issue is that user namespaces allow groups to be
>>>> dropped using setgroups that wouldn't otherwise be dropped.  Can we
>>>> get away with adding a per-user-ns flag that determines whether
>>>> setgroups can be used?
>>>
>>> Being able to call setgroups is fundamental to login programs, and login
>>> programs are one of the things user namespaces need to support.  So
>>> adding an extra flag and an extra place where privilege is required
>>> is just noise, and will wind up breaking every user of user namespaces.
>>>
>>> Further being able to setup uid and gid mappings without privilege is
>>> primarily a nice to have.  The original design did not have unprivileged
>>> setting of uid and gid maps and if it proves insecure I goofed and the
>>> feature isn't safe so it needs to be removed.
>>
>> Being able to set a single-user uid map and gid map is very useful for
>> sandboxing.  This lets unprivileged users drop filesystem and network
>> access and still run most normal programs.  A surprising number of
>> normal unprivileged programs fail if run without a mapping.
>
> You can still set a single uid map unprivileged.  That should be enough
> to keep your capabilities inside the namespace as long as you need them.
>
> I am sad that in practice you can't set a single gid map, as everyone
> calls setgroups.
>

That's not the problem.  The problem is that a surprising number of
libraries expect getegid(), etc to return sensible values.

> Although I sort of think it might be worth scouring userspace for
> something that will call setgroups and drop all of your groups.  If we
> can find something preexisting that will solve this entire mess in a
> much more elegant way.
>
>>> This does mean that running a system with negative groups and users
>>> delegated subordinate gids in /etc/subuid is a bad idea and system
>>> administrators shouldn't do that as those negative groups won't prove
>>> effective in stopping their users.  But this is all under system
>>> administrator control so shrug.  There isn't a way to avoid that
>>> fundamental conflict.
>>>
>>>> setgroups would be unusable until the gid_map has been written and
>>>> then it would become usable if and only if the parent userns could use
>>>> setgroups and the opener of gid_map was privileged.
>>>
>>> That proposal sounds a lot more restrictive and a lot more of a pain
>>> to use than what I have implemented in my patch.
>>>
>>>> If we wanted to allow finer-grained control, we could allow writing
>>>> control lines like:
>>>>
>>>> options +setgroups
>>>>
>>>> or
>>>>
>>>> options -setgroups
>>>>
>>>> in gid_map, or we could add user_ns_flags that can only be written
>>>> once and only before either uid_map or gid_map is written.
>>>
>>> Definitely more complicated and I can't imagine a case where I need
>>> a gid map without needing to call setgroups.
>>
>> I do it all the time.  Unshare, set mappings (with no inner uid 0 at
>> all), set no_new_privs, drop caps, and go.
>>
>> Can we try the intermediate approach?  If you set gid_map without
>> privilege and you have supplementary groups, then let the write to
>> gid_map succeed but prevent setgroups from ever working?  That should
>> only be a couple of lines of code longer than your patch, and it will
>> avoid breaking sandbox use cases.
>
> I am torn.  Send me an incremental patch and I will be happy to evaluate
> it and if all is good fold the change in.  I hate breaking userspace but
> if I break it I would rather it be a clean break that is easy to spot
> and work around rather than something that almost works, and causes
> people a lot of difficulty debugging.

I'll play with it this afternoon or over the weekend.  I'm currently
on vacation, so I'll be a little slow.

>
> My expectation is that systems that are serious will have /etc/subuid
> and /etc/subgid and newuidmap and newgidmap setup.  Which is the other
> way to allow you to map your own gid.
>
>> If we want to get really fancy in the future, we could have a concept
>> of pinned groups.  That is, if you're in a userns and you're a member
>> of an unmapped group, then you can't drop that group.  (Actually, that
>> all by itself might be enough to fix this issue.)
>
> Not allowing you to drop groups really isn't enough.  One of the first
> things applications like ssh do is call setgroups(0, NULL) to drop the
> privileges granted by supplementary groups.  Further login program
> somewhere call setgroups(N, ....) and give you every group mapped
> by /etc/group.
>
> I don't think I want to run containers with every supplementary group I
> might want to drop mapped, and more than that, it would require changing
> a lot more userspace than just the userspace that just does unpriv
> containers with a single uid, and a single gid mapped.
>
> But please test and see if you really need to map your group, and send
> me an incremental patch if you see a way to do better.  Breaking
> userspace sucks.

I *know* I need the uid mapped, and I'm pretty sure I need the gid as
well.  FWIW, the code I care about won't object too strongly to a
one-time break, but it will object to needing to use subuids, since it
will have all kinds of problems if it starts to need to rely on setuid
helpers.

There's a third option: use your patch but require explicit userspace
opt-in for code that wants the setgroups-not-allowed mode.  I'll try
implementing that.

--Andy

^ permalink raw reply

* Re: [PATCH v3 3/5] arm64: dts: Add support for Spreadtrum SC9836 SoC in dts and Makefile
From: Mark Rutland @ 2014-11-28 16:40 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: andrew@lunn.ch, heiko@sntech.de, gnomes@lxorguk.ukuu.org.uk,
	Will Deacon, Chunyan Zhang, jason@lakedaemon.net,
	lanqing.liu@spreadtrum.com, arnd@arndb.de, corbet@lwn.net,
	zhang.lyra@gmail.com, zhizhou.zhang@spreadtrum.com,
	geng.ren@spreadtrum.com, m-karicheri2@ti.com,
	linux-arm-kernel@lists.infradead.org,
	linux-serial@vger.kernel.org, grant.likely@linaro.org,
	orsonzhai@gmail.com, florian.vaussard@epfl.ch
In-Reply-To: <20141128150325.GC24370@e104818-lin.cambridge.arm.com>

[...]

> In the meantime I think we can be more tolerant:
> 
> diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
> index 2133f9d59d06..87f67a93fcc7 100644
> --- a/drivers/clocksource/arm_arch_timer.c
> +++ b/drivers/clocksource/arm_arch_timer.c
> @@ -376,6 +376,8 @@ arch_timer_detect_rate(void __iomem *cntbase, struct device_node *np)
>  			arch_timer_rate = readl_relaxed(cntbase + CNTFRQ);
>  		else
>  			arch_timer_rate = arch_timer_get_cntfrq();
> +	} else if (IS_ENABLED(CONFIG_ARM64)) {
> +		pr_warn("Architected timer frequency overridden by DT (broken firmware?)\n");
>  	}

That looks good to me. On a related note, it looks like my documentation
patch [1] from a while back fell by the wayside. Does anyone fancy
picking that up to go with this?

Mark.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-August/282830.html

^ permalink raw reply

* Re: [CFT][PATCH] userns: Avoid problems with negative groups
From: Eric W. Biederman @ 2014-11-28 16:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux Containers, Josh Triplett, Andrew Morton, Kees Cook,
	Michael Kerrisk-manpages, Linux API, linux-man,
	linux-kernel@vger.kernel.org, LSM, Casey Schaufler,
	Serge E. Hallyn, Richard Weinberger
In-Reply-To: <CALCETrX2s-7iaLMEKLQsExTEp3JyoAPQG44p0v5wkeED3-6dQA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:

> On Thu, Nov 27, 2014 at 9:21 PM, Eric W. Biederman
> <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> wrote:
>> Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> writes:
>>
>>>> This change should break userspace by the minimal amount needed
>>>> to fix this issue.
>>>>
>>>> This should fix CVE-2014-8989.
>>>
>>> I think this is both unnecessarily restrictive and that it doesn't fix
>>> the bug.
>>
>> You are going to have to work very hard to convince me this is
>> unnecessarily restrictive.
>>
>>>For example, I can exploit CVE-2014-8989 without ever
>>> writing a uid map or a gid map.
>>
>> Yes.  I realized just after I sent the patch that setgroups(0, NULL)
>> would still work without a mapping set.  That is a first glass grade a
>> oversight that resulted in a bug.  None of the other uid or gid changing
>> syscalls without a mapping set, and setgroups was just overlooked
>> because it was different.  Oops.
>>
>> I will send an updated patch that stops setgroups from working without
>> a mapping set shortly.
>>
>>> IIUC, the only real issue is that user namespaces allow groups to be
>>> dropped using setgroups that wouldn't otherwise be dropped.  Can we
>>> get away with adding a per-user-ns flag that determines whether
>>> setgroups can be used?
>>
>> Being able to call setgroups is fundamental to login programs, and login
>> programs are one of the things user namespaces need to support.  So
>> adding an extra flag and an extra place where privilege is required
>> is just noise, and will wind up breaking every user of user namespaces.
>>
>> Further being able to setup uid and gid mappings without privilege is
>> primarily a nice to have.  The original design did not have unprivileged
>> setting of uid and gid maps and if it proves insecure I goofed and the
>> feature isn't safe so it needs to be removed.
>
> Being able to set a single-user uid map and gid map is very useful for
> sandboxing.  This lets unprivileged users drop filesystem and network
> access and still run most normal programs.  A surprising number of
> normal unprivileged programs fail if run without a mapping.

You can still set a single uid map unprivileged.  That should be enough
to keep your capabilities inside the namespace as long as you need them.

I am sad that in practice you can't set a single gid map, as everyone
calls setgroups.

Although I sort of think it might be worth scouring userspace for
something that will call setgroups and drop all of your groups.  If we
can find something preexisting that will solve this entire mess in a
much more elegant way.

>> This does mean that running a system with negative groups and users
>> delegated subordinate gids in /etc/subuid is a bad idea and system
>> administrators shouldn't do that as those negative groups won't prove
>> effective in stopping their users.  But this is all under system
>> administrator control so shrug.  There isn't a way to avoid that
>> fundamental conflict.
>>
>>> setgroups would be unusable until the gid_map has been written and
>>> then it would become usable if and only if the parent userns could use
>>> setgroups and the opener of gid_map was privileged.
>>
>> That proposal sounds a lot more restrictive and a lot more of a pain
>> to use than what I have implemented in my patch.
>>
>>> If we wanted to allow finer-grained control, we could allow writing
>>> control lines like:
>>>
>>> options +setgroups
>>>
>>> or
>>>
>>> options -setgroups
>>>
>>> in gid_map, or we could add user_ns_flags that can only be written
>>> once and only before either uid_map or gid_map is written.
>>
>> Definitely more complicated and I can't imagine a case where I need
>> a gid map without needing to call setgroups.
>
> I do it all the time.  Unshare, set mappings (with no inner uid 0 at
> all), set no_new_privs, drop caps, and go.
>
> Can we try the intermediate approach?  If you set gid_map without
> privilege and you have supplementary groups, then let the write to
> gid_map succeed but prevent setgroups from ever working?  That should
> only be a couple of lines of code longer than your patch, and it will
> avoid breaking sandbox use cases.

I am torn.  Send me an incremental patch and I will be happy to evaluate
it and if all is good fold the change in.  I hate breaking userspace but
if I break it I would rather it be a clean break that is easy to spot
and work around rather than something that almost works, and causes
people a lot of difficulty debugging.

My expectation is that systems that are serious will have /etc/subuid
and /etc/subgid and newuidmap and newgidmap setup.  Which is the other
way to allow you to map your own gid.

> If we want to get really fancy in the future, we could have a concept
> of pinned groups.  That is, if you're in a userns and you're a member
> of an unmapped group, then you can't drop that group.  (Actually, that
> all by itself might be enough to fix this issue.)

Not allowing you to drop groups really isn't enough.  One of the first
things applications like ssh do is call setgroups(0, NULL) to drop the
privileges granted by supplementary groups.  Further login program
somewhere call setgroups(N, ....) and give you every group mapped
by /etc/group.

I don't think I want to run containers with every supplementary group I
might want to drop mapped, and more than that, it would require changing
a lot more userspace than just the userspace that just does unpriv
containers with a single uid, and a single gid mapped.

But please test and see if you really need to map your group, and send
me an incremental patch if you see a way to do better.  Breaking
userspace sucks.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox