[0/10] User namespaces: introduction

All of lore.kernel.org
 help / color / mirror / Atom feed

* [0/10] User namespaces: introduction
@ 2008-08-22 19:45 Serge E. Hallyn
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:45 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

Hi Eric,

so here is a start to a userns patchset trying to follow your ideas
about how to have user namespaces and filesystems interact.  Ignore
the bookkeeping crap or you'll pull your hair out.  Lots of stuff
remains unimplemented - i.e. chown (setattr) and proper handling of
capabilities.  But you can do some fun things with this patchset.
I.e.

	(log in as root)
	setcap cap_sys_admin=ep ns_exec
	setcap cap_sys_admin=ep usernsmount
	ns_exec -U /bin/sh
	ls /root (fails)
	ls / (succeeds)
	(log in as hallyn)
	ns_exec -U /bin/sh
	id
		(uid=0, gid=0)
	ls (fails, can't descend /home/hallyn)
	usernsmount / nsid=4
	ls (succeeds)
	touch ab
	ls -l ab
		(ab is owned by root)
	exit
	(we're logged in as hallyn in the init_user_ns again)
	ls -l ab
		(ab is owned by hallyn)

The only supported fs is ext3.  Only a few operations are supported.
So if, above, when we are hallyn in the init_user_ns but root in
the child user ns,
	when we create a file, it is properly handled, so
		inode->i_uid=500, but an xattr (nsid=4,uid=0) is added
	when we chown the file to root, it is not properly handled,
		so inode->i_uid = 0
it's just a matter of hooking all the places at this point.

Capabilities remain a problem.  Right now I think capabilities will
need to be split up into system-wide caps, and container-safe caps.
So CAP_NET_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, those are container-safe.
CAP_REBOOT may become container-safe one day, but for now is very
much system-wide.

So if I'm uid 500 on the host and create a user namespace where I'm
uid=0, I should be able to acquire container-safe caps (perhaps
contingent on whether I unshared all other namespaces), but not
system-wide ones.  Or, whether I can acquire them would depend
on whether the suid bit was set in a user_ns or not.  sigh.

thanks,
-serge

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/10] user namespaces: introduce user_struct->user_namespace relationship
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-22 19:45   ` Serge E. Hallyn
  2008-08-22 19:45   ` [PATCH 02/10] user namespaces: move user_ns from nsproxy into user struct Serge E. Hallyn
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:45 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

When a task does clone(CLONE_NEWNS), the task's user is the 'creator' of the
new user_namespace, and the user_namespace is tacked onto a list of those
created by this user.

Changelog:
	Aug 1: renamed user->user_namespace to user_ns, as the next
		patch did anyway.
	Aug 1: move put_user_ns call in one free_user() definition
		to move it outside the lock in free_user.  put_user_ns
		calls free_user on the user_ns->creator, which in
		turn would grab the lock again.

Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/sched.h          |    1 +
 include/linux/user_namespace.h |    1 +
 kernel/user.c                  |    7 +++++++
 kernel/user_namespace.c        |   20 +++++++++++---------
 4 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5850bfb..b0fe15a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -600,6 +600,7 @@ struct user_struct {
 	/* Hash table maintenance information */
 	struct hlist_node uidhash_node;
 	uid_t uid;
+	struct user_namespace *user_ns;
 
 #ifdef CONFIG_USER_SCHED
 	struct task_group *tg;
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index b5f41d4..f9477c3 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -13,6 +13,7 @@ struct user_namespace {
 	struct kref		kref;
 	struct hlist_head	uidhash_table[UIDHASH_SZ];
 	struct user_struct	*root_user;
+	struct user_struct	*creator;
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/user.c b/kernel/user.c
index 865ecf5..aedb3a1 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -22,6 +22,7 @@ struct user_namespace init_user_ns = {
 		.refcount	= ATOMIC_INIT(2),
 	},
 	.root_user = &root_user,
+	.creator = &root_user,
 };
 EXPORT_SYMBOL_GPL(init_user_ns);
 
@@ -53,6 +54,7 @@ struct user_struct root_user = {
 	.files		= ATOMIC_INIT(0),
 	.sigpending	= ATOMIC_INIT(0),
 	.locked_shm     = 0,
+	.user_ns	= &init_user_ns,
 #ifdef CONFIG_USER_SCHED
 	.tg		= &init_task_group,
 #endif
@@ -325,6 +327,7 @@ static inline void free_user(struct user_struct *up, unsigned long flags)
 	atomic_inc(&up->__count);
 	spin_unlock_irqrestore(&uidhash_lock, flags);
 
+	put_user_ns(up->user_ns);
 	INIT_WORK(&up->work, remove_user_sysfs_dir);
 	schedule_work(&up->work);
 }
@@ -347,6 +350,7 @@ static inline void free_user(struct user_struct *up, unsigned long flags)
 	sched_destroy_user(up);
 	key_put(up->uid_keyring);
 	key_put(up->session_keyring);
+	put_user_ns(up->user_ns);
 	kmem_cache_free(uid_cachep, up);
 }
 
@@ -409,6 +413,8 @@ struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid)
 		if (sched_create_user(new) < 0)
 			goto out_free_user;
 
+		new->user_ns = get_user_ns(ns);
+
 		if (uids_user_create(new))
 			goto out_destoy_sched;
 
@@ -441,6 +447,7 @@ struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid)
 
 out_destoy_sched:
 	sched_destroy_user(new);
+	put_user_ns(new->user_ns);
 out_free_user:
 	kmem_cache_free(uid_cachep, new);
 out_unlock:
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index a9ab059..e8db443 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -19,7 +19,6 @@
 static struct user_namespace *clone_user_ns(struct user_namespace *old_ns)
 {
 	struct user_namespace *ns;
-	struct user_struct *new_user;
 	int n;
 
 	ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
@@ -38,15 +37,17 @@ static struct user_namespace *clone_user_ns(struct user_namespace *old_ns)
 		return ERR_PTR(-ENOMEM);
 	}
 
-	/* Reset current->user with a new one */
-	new_user = alloc_uid(ns, current->uid);
-	if (!new_user) {
-		free_uid(ns->root_user);
-		kfree(ns);
-		return ERR_PTR(-ENOMEM);
-	}
+	/* pin the creating user */
+	ns->creator = current->user;
+	atomic_inc(&ns->creator->__count);
+
+	/*
+	 * The alloc_uid() incremented the userns refcount,
+	 * so drop it again 
+	 */
+	put_user_ns(ns);
 
-	switch_uid(new_user);
+	switch_uid(ns->root_user);
 	return ns;
 }
 
@@ -72,6 +73,7 @@ void free_user_ns(struct kref *kref)
 
 	ns = container_of(kref, struct user_namespace, kref);
 	release_uids(ns);
+	free_uid(ns->creator);
 	kfree(ns);
 }
 EXPORT_SYMBOL(free_user_ns);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/10] user namespaces: move user_ns from nsproxy into user struct
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-08-22 19:45   ` [PATCH 01/10] user namespaces: introduce user_struct->user_namespace relationship Serge E. Hallyn
@ 2008-08-22 19:45   ` Serge E. Hallyn
  2008-08-22 19:45   ` [PATCH 03/10] user namespaces: reset task's credentials on CLONE_NEWUSER Serge E. Hallyn
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:45 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

When we get the sysfs support needed to support fair user scheduling
along with user namespaces, then we will need to be able to get the
user namespace from the user struct.

So we need the user_ns to be a part of struct user.  Once we can
access it from tsk->user, we no longer have a use for
tsk->nsproxy->user_ns.

When a user_namespace is created, the user which created it is
marked as its 'creator'.  The user_namespace pins the creator.
Each userid in a user_ns pins the user_ns.  This keeps refcounting
nice and simple.

At the same time, this patch simplifies the refcounting.  The current
user and userns locking works as follows:

	The task pins the user struct.

	The task pins the nsproxy, the nsproxy pins the user_ns.

	When a new user_ns is created, it creates a root user for
	it, and pins it.  When the nsproxy releases the user_ns,
	the userns tries to release all its user structs.

	So you see that the refcounting "works" for now, but only
	because the nsproxy (and therefore usr_ns) and user structs
	will be freed at the same time - when the last task using
	them is released.

	Now we need to put the user_ns in the struct user.  You can
	see that will mess up the refcounting.

	Fortunately, once the user_ns is available from tsk->user,
	we don't need it in nsproxy.

	So here is how the refcounting *should* be done:

	The task pins the user struct.

	The user struct pins its user namespace.

	The user namespace pins the struct user which created it.

	A user namespace now doesn't need to release its userids,
	because it is only released when its last user disappears.

This patch makes those changes.

Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 include/linux/init_task.h      |    1 -
 include/linux/key.h            |    3 ++
 include/linux/nsproxy.h        |    1 -
 include/linux/sched.h          |    1 +
 include/linux/user_namespace.h |   10 +++-----
 kernel/fork.c                  |    3 +-
 kernel/nsproxy.c               |   10 +-------
 kernel/sys.c                   |    4 +-
 kernel/user.c                  |   42 +++++++++------------------------------
 kernel/user_namespace.c        |   36 +++++++++------------------------
 security/keys/process_keys.c   |    7 +++++-
 11 files changed, 39 insertions(+), 79 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 021d8e7..550058b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -57,7 +57,6 @@ extern struct nsproxy init_nsproxy;
 	.mnt_ns		= NULL,						\
 	INIT_NET_NS(net_ns)                                             \
 	INIT_IPC_NS(ipc_ns)						\
-	.user_ns	= &init_user_ns,				\
 }
 
 #define INIT_SIGHAND(sighand) {						\
diff --git a/include/linux/key.h b/include/linux/key.h
index c45c962..ba53aef 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -277,6 +277,8 @@ extern ctl_table key_sysctls[];
  * the userspace interface
  */
 extern void switch_uid_keyring(struct user_struct *new_user);
+extern void switch_uid_keyring_task(struct task_struct *tsk,
+				struct user_struct *new_user);
 extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk);
 extern int copy_thread_group_keys(struct task_struct *tsk);
 extern void exit_keys(struct task_struct *tsk);
@@ -305,6 +307,7 @@ extern void key_init(void);
 #define key_ref_to_ptr(k)		({ NULL; })
 #define is_key_possessed(k)		0
 #define switch_uid_keyring(u)		do { } while(0)
+#define switch_uid_keyring_task(t,u)	do { } while(0)
 #define __install_session_keyring(t, k)	({ NULL; })
 #define copy_keys(f,t)			0
 #define copy_thread_group_keys(t)	0
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index c8a768e..afad7de 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -27,7 +27,6 @@ struct nsproxy {
 	struct ipc_namespace *ipc_ns;
 	struct mnt_namespace *mnt_ns;
 	struct pid_namespace *pid_ns;
-	struct user_namespace *user_ns;
 	struct net 	     *net_ns;
 };
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b0fe15a..a2f1356 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1715,6 +1715,7 @@ static inline struct user_struct *get_uid(struct user_struct *u)
 }
 extern void free_uid(struct user_struct *);
 extern void switch_uid(struct user_struct *);
+extern void task_switch_uid(struct task_struct *tsk, struct user_struct *);
 extern void release_uids(struct user_namespace *ns);
 
 #include <asm/current.h>
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index f9477c3..1b4959d 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -27,8 +27,7 @@ static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 	return ns;
 }
 
-extern struct user_namespace *copy_user_ns(int flags,
-					   struct user_namespace *old_ns);
+extern int create_new_userns(int flags, struct task_struct *tsk);
 extern void free_user_ns(struct kref *kref);
 
 static inline void put_user_ns(struct user_namespace *ns)
@@ -44,13 +43,12 @@ static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 	return &init_user_ns;
 }
 
-static inline struct user_namespace *copy_user_ns(int flags,
-						  struct user_namespace *old_ns)
+static inline int create_new_userns(int flags, struct task_struct *tsk)
 {
 	if (flags & CLONE_NEWUSER)
-		return ERR_PTR(-EINVAL);
+		return -EINVAL;
 
-	return old_ns;
+	return 0;
 }
 
 static inline void put_user_ns(struct user_namespace *ns)
diff --git a/kernel/fork.c b/kernel/fork.c
index 7ce2ebe..b5d4dc5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -941,8 +941,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	retval = -EAGAIN;
 	if (atomic_read(&p->user->processes) >=
 			p->signal->rlim[RLIMIT_NPROC].rlim_cur) {
-		if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RESOURCE) &&
-		    p->user != current->nsproxy->user_ns->root_user)
+		if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RESOURCE))
 			goto bad_fork_free;
 	}
 
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 21575fc..2edac05 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -81,11 +81,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 		goto out_pid;
 	}
 
-	new_nsp->user_ns = copy_user_ns(flags, tsk->nsproxy->user_ns);
-	if (IS_ERR(new_nsp->user_ns)) {
-		err = PTR_ERR(new_nsp->user_ns);
+	err = create_new_userns(flags, tsk);
+	if (err)
 		goto out_user;
-	}
 
 	new_nsp->net_ns = copy_net_ns(flags, tsk->nsproxy->net_ns);
 	if (IS_ERR(new_nsp->net_ns)) {
@@ -96,8 +94,6 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
 	return new_nsp;
 
 out_net:
-	if (new_nsp->user_ns)
-		put_user_ns(new_nsp->user_ns);
 out_user:
 	if (new_nsp->pid_ns)
 		put_pid_ns(new_nsp->pid_ns);
@@ -174,8 +170,6 @@ void free_nsproxy(struct nsproxy *ns)
 		put_ipc_ns(ns->ipc_ns);
 	if (ns->pid_ns)
 		put_pid_ns(ns->pid_ns);
-	if (ns->user_ns)
-		put_user_ns(ns->user_ns);
 	put_net(ns->net_ns);
 	kmem_cache_free(nsproxy_cachep, ns);
 }
diff --git a/kernel/sys.c b/kernel/sys.c
index c018580..4d8d415 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -553,13 +553,13 @@ static int set_user(uid_t new_ruid, int dumpclear)
 {
 	struct user_struct *new_user;
 
-	new_user = alloc_uid(current->nsproxy->user_ns, new_ruid);
+	new_user = alloc_uid(current->user->user_ns, new_ruid);
 	if (!new_user)
 		return -EAGAIN;
 
 	if (atomic_read(&new_user->processes) >=
 				current->signal->rlim[RLIMIT_NPROC].rlim_cur &&
-			new_user != current->nsproxy->user_ns->root_user) {
+			!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) {
 		free_uid(new_user);
 		return -EAGAIN;
 	}
diff --git a/kernel/user.c b/kernel/user.c
index aedb3a1..d3dd353 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -366,7 +366,7 @@ struct user_struct *find_user(uid_t uid)
 {
 	struct user_struct *ret;
 	unsigned long flags;
-	struct user_namespace *ns = current->nsproxy->user_ns;
+	struct user_namespace *ns = current->user->user_ns;
 
 	spin_lock_irqsave(&uidhash_lock, flags);
 	ret = uid_hash_find(uid, uidhashentry(ns, uid));
@@ -455,7 +455,7 @@ out_unlock:
 	return NULL;
 }
 
-void switch_uid(struct user_struct *new_user)
+void task_switch_uid(struct task_struct *tsk, struct user_struct *new_user)
 {
 	struct user_struct *old_user;
 
@@ -464,12 +464,12 @@ void switch_uid(struct user_struct *new_user)
 	 * cheaply with the new uid cache, so if it matters
 	 * we should be checking for it.  -DaveM
 	 */
-	old_user = current->user;
+	old_user = tsk->user;
 	atomic_inc(&new_user->processes);
 	atomic_dec(&old_user->processes);
-	switch_uid_keyring(new_user);
-	current->user = new_user;
-	sched_switch_user(current);
+	switch_uid_keyring_task(tsk, new_user);
+	tsk->user = new_user;
+	sched_switch_user(tsk);
 
 	/*
 	 * We need to synchronize with __sigqueue_alloc()
@@ -479,38 +479,16 @@ void switch_uid(struct user_struct *new_user)
 	 * structure.
 	 */
 	smp_mb();
-	spin_unlock_wait(&current->sighand->siglock);
+	spin_unlock_wait(&tsk->sighand->siglock);
 
 	free_uid(old_user);
-	suid_keys(current);
+	suid_keys(tsk);
 }
 
-#ifdef CONFIG_USER_NS
-void release_uids(struct user_namespace *ns)
+void switch_uid(struct user_struct *new_user)
 {
-	int i;
-	unsigned long flags;
-	struct hlist_head *head;
-	struct hlist_node *nd;
-
-	spin_lock_irqsave(&uidhash_lock, flags);
-	/*
-	 * collapse the chains so that the user_struct-s will
-	 * be still alive, but not in hashes. subsequent free_uid()
-	 * will free them.
-	 */
-	for (i = 0; i < UIDHASH_SZ; i++) {
-		head = ns->uidhash_table + i;
-		while (!hlist_empty(head)) {
-			nd = head->first;
-			hlist_del_init(nd);
-		}
-	}
-	spin_unlock_irqrestore(&uidhash_lock, flags);
-
-	free_uid(ns->root_user);
+	task_switch_uid(current, new_user);
 }
-#endif
 
 static int __init uid_cache_init(void)
 {
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index e8db443..0045dd0 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -16,14 +16,17 @@
  * @old_ns: namespace to clone
  * Return NULL on error (failure to kmalloc), new ns otherwise
  */
-static struct user_namespace *clone_user_ns(struct user_namespace *old_ns)
+int create_new_userns(int flags, struct task_struct *tsk)
 {
 	struct user_namespace *ns;
 	int n;
 
+	if (!(flags & CLONE_NEWUSER))
+		return 0;
+
 	ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
 	if (!ns)
-		return ERR_PTR(-ENOMEM);
+		return -ENOMEM;
 
 	kref_init(&ns->kref);
 
@@ -34,37 +37,19 @@ static struct user_namespace *clone_user_ns(struct user_namespace *old_ns)
 	ns->root_user = alloc_uid(ns, 0);
 	if (!ns->root_user) {
 		kfree(ns);
-		return ERR_PTR(-ENOMEM);
+		return -ENOMEM;
 	}
 
 	/* pin the creating user */
-	ns->creator = current->user;
+	ns->creator = tsk->user;
 	atomic_inc(&ns->creator->__count);
 
-	/*
-	 * The alloc_uid() incremented the userns refcount,
-	 * so drop it again 
-	 */
+	/* alloc_uid() incremented the userns refcount.  drop it again */
 	put_user_ns(ns);
 
-	switch_uid(ns->root_user);
-	return ns;
-}
-
-struct user_namespace * copy_user_ns(int flags, struct user_namespace *old_ns)
-{
-	struct user_namespace *new_ns;
-
-	BUG_ON(!old_ns);
-	get_user_ns(old_ns);
-
-	if (!(flags & CLONE_NEWUSER))
-		return old_ns;
-
-	new_ns = clone_user_ns(old_ns);
+	task_switch_uid(tsk, ns->root_user);
 
-	put_user_ns(old_ns);
-	return new_ns;
+	return 0;
 }
 
 void free_user_ns(struct kref *kref)
@@ -72,7 +57,6 @@ void free_user_ns(struct kref *kref)
 	struct user_namespace *ns;
 
 	ns = container_of(kref, struct user_namespace, kref);
-	release_uids(ns);
 	free_uid(ns->creator);
 	kfree(ns);
 }
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index 5be6d01..8d1cfb2 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -119,7 +119,8 @@ error:
 /*
  * deal with the UID changing
  */
-void switch_uid_keyring(struct user_struct *new_user)
+void switch_uid_keyring_task(struct task_struct *task,
+				struct user_struct *new_user)
 {
 #if 0 /* do nothing for now */
 	struct key *old;
@@ -142,6 +143,10 @@ void switch_uid_keyring(struct user_struct *new_user)
 
 } /* end switch_uid_keyring() */
 
+void switch_uid_keyring(struct user_struct *new_user)
+{
+	switch_uid_keyring_task(current, new_user);
+}
 /*****************************************************************************/
 /*
  * install a fresh thread keyring, discarding the old one
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/10] user namespaces: reset task's credentials on CLONE_NEWUSER
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-08-22 19:45   ` [PATCH 01/10] user namespaces: introduce user_struct->user_namespace relationship Serge E. Hallyn
  2008-08-22 19:45   ` [PATCH 02/10] user namespaces: move user_ns from nsproxy into user struct Serge E. Hallyn
@ 2008-08-22 19:45   ` Serge E. Hallyn
  2008-08-22 19:46   ` [PATCH 04/10] user namespaces: enforce user namespaces for file permission Serge E. Hallyn
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:45 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

Currently, creating a new user namespace does not reset
the task's uid or gid.  Since generally that is done as
root because it requires CAP_SYS_ADMIN, and since the
first uid in the new namespace is 0, one usually doesn't
notice.  However, if one does

	capset cap_sys_admin=ep ns_exec
	su - hallyn
	  ns_exec -U /bin/sh
	  id

then one will see hallyn's userid, and all preexisting
groups.

With this patch, cloning a new user namespace will set
the task's uid and gid to 0, and reset the group_info to
the empty set assigned to init.

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 kernel/user_namespace.c |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 0045dd0..39aea7b 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -11,6 +11,9 @@
 #include <linux/slab.h>
 #include <linux/user_namespace.h>
 
+/* defined in kernel/sys.c */
+extern struct group_info init_groups;
+
 /*
  * Clone a new ns copying an original user ns, setting refcount to 1
  * @old_ns: namespace to clone
@@ -48,6 +51,17 @@ int create_new_userns(int flags, struct task_struct *tsk)
 	put_user_ns(ns);
 
 	task_switch_uid(tsk, ns->root_user);
+	tsk->uid = tsk->euid = tsk->suid = tsk->fsuid = 0;
+	tsk->gid = tsk->egid = tsk->sgid = tsk->fsgid = 0;
+
+	/* this can't be safe for unshare, can it?  it's safe
+	 * for fork, though.  I'm tempted to limit clone_newuser to
+	 * fork only */
+	task_lock(tsk);
+	put_group_info(tsk->group_info);
+	tsk->group_info = &init_groups;
+	get_group_info(tsk->group_info);
+	task_unlock(tsk);
 
 	return 0;
 }
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/10] user namespaces: enforce user namespaces for file permission
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2008-08-22 19:45   ` [PATCH 03/10] user namespaces: reset task's credentials on CLONE_NEWUSER Serge E. Hallyn
@ 2008-08-22 19:46   ` Serge E. Hallyn
       [not found]     ` <20080822194609.GD10360-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-08-22 19:46   ` [PATCH 05/10] user namespaces: Allow registering new usernamespaces using mount Serge E. Hallyn
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:46 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

Add a user_ns to the sb.  It is always set to the user_ns of the task which
mounted the sb.

Define 3 new super_operations.  convert_uid() and convert_gid() take a uid
or gid from an inode on the sb's fs, and attempt to convert them into ids
meaningful in the user namespace passed in, which presumably is current's.
is_capable() checks whether current has the requested capability with respect
to the sb's fs, which is dependent upon the relationship between current's
user_ns and those which the sb knows about.

If the sb isn't user ns aware - which none are right now - then the new
super_operations won't be defined.  If in that case current and sb have
different user namespaces, then the userids can't be compared.

If current's and sb's userids can't be compared, then current will get
'user other' (we usually say 'nobody') permissions to the inode.

Changelog:
	Aug 5: send the whole inode to the super_operations.

Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/namei.c         |   29 +++++++++++++++++++++++---
 fs/super.c         |    3 ++
 include/linux/fs.h |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 4ea63ed..6bf5446 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -183,10 +183,27 @@ int generic_permission(struct inode *inode, int mask,
 		int (*check_acl)(struct inode *inode, int mask))
 {
 	umode_t			mode = inode->i_mode;
+	uid_t			iuid;
+	gid_t			igid;
+	int			ret;
+	int			good_userns = 1;
+
+	ret = s_convert_uid_gid(inode, current->user->user_ns,
+						&iuid, &igid);
+	if (!ret)
+		good_userns = 0;
 
 	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
 
-	if (current->fsuid == inode->i_uid)
+	/*
+	 * If we're not in the inode's user namespace, we get
+	 * user nobody permissions, and we ignore acls
+	 * (bc serge doesn't know how to handle acls in this case)
+	 */
+	if (!good_userns)
+		goto check;
+
+	if (current->fsuid == iuid)
 		mode >>= 6;
 	else {
 		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
@@ -197,15 +214,18 @@ int generic_permission(struct inode *inode, int mask,
 				return error;
 		}
 
-		if (in_group_p(inode->i_gid))
+		if (in_group_p(igid))
 			mode >>= 3;
 	}
 
+check:
 	/*
 	 * If the DACs are ok we don't need any capability check.
 	 */
 	if ((mask & ~mode) == 0)
 		return 0;
+	if (!good_userns)
+		return -EACCES;
 
  check_capabilities:
 	/*
@@ -214,14 +234,15 @@ int generic_permission(struct inode *inode, int mask,
 	 */
 	if (!(mask & MAY_EXEC) ||
 	    (inode->i_mode & S_IXUGO) || S_ISDIR(inode->i_mode))
-		if (capable(CAP_DAC_OVERRIDE))
+		if (s_is_capable(inode, current->user->user_ns,
+						CAP_DAC_OVERRIDE))
 			return 0;
 
 	/*
 	 * Searching includes executable on directories, else just read.
 	 */
 	if (mask == MAY_READ || (S_ISDIR(inode->i_mode) && !(mask & MAY_WRITE)))
-		if (capable(CAP_DAC_READ_SEARCH))
+		if (s_is_capable(inode, current->user->user_ns, CAP_DAC_READ_SEARCH))
 			return 0;
 
 	return -EACCES;
diff --git a/fs/super.c b/fs/super.c
index e931ae9..3559637 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -38,6 +38,7 @@
 #include <linux/kobject.h>
 #include <linux/mutex.h>
 #include <linux/file.h>
+#include <linux/user_namespace.h>
 #include <asm/uaccess.h>
 #include "internal.h"
 
@@ -93,6 +94,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		s->s_qcop = sb_quotactl_ops;
 		s->s_op = &default_op;
 		s->s_time_gran = 1000000000;
+		s->user_ns = get_user_ns(current->user->user_ns);
 	}
 out:
 	return s;
@@ -109,6 +111,7 @@ static inline void destroy_super(struct super_block *s)
 	security_sb_free(s);
 	kfree(s->s_subtype);
 	kfree(s->s_options);
+	put_user_ns(s->user_ns);
 	kfree(s);
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 580b513..bb58c2e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1128,6 +1128,11 @@ struct super_block {
 	 * generic_show_options()
 	 */
 	char *s_options;
+
+	/*
+	 * namespace of the user which mounted the sb
+	 */
+	struct user_namespace *user_ns;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
@@ -1320,6 +1325,9 @@ struct super_operations {
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*clear_inode) (struct inode *);
 	void (*umount_begin) (struct super_block *);
+	int (*is_capable) (struct user_namespace *, struct inode *, int);
+	uid_t (*convert_uid_gid)(struct user_namespace *, struct inode *,
+						uid_t *, gid_t *);
 
 	int (*show_options)(struct seq_file *, struct vfsmount *);
 	int (*show_stats)(struct seq_file *, struct vfsmount *);
@@ -1330,6 +1338,53 @@ struct super_operations {
 };
 
 /*
+ * In the next 3, i'm passing the user_ns in so that I don't need
+ * to know how to dereference struct user her (which would require
+ * #including sched.h).
+ * Note in particular that for instance in s_convert_uid, the uid is the
+ * inode->i_uid, while user_ns is current->user->user_ns.
+ */
+
+/*
+ * s_convert_uid: attempt to translate the inode's stored
+ * uid to a uid meaningful in user_ns passed in
+ * If possible, store the result in reuid and return 1
+ * Otherwise, return 0, meaningful the uid cannot be translated
+ */
+static inline int s_convert_uid_gid(struct inode *ino,
+		struct user_namespace *user_ns, uid_t *retuid, gid_t *retgid)
+{
+	struct super_block *sb = ino->i_sb;
+
+	if (sb->user_ns == user_ns) {
+		*retuid = ino->i_uid;
+		*retgid = ino->i_gid;
+		return 1;
+	}
+	if (sb->s_op->convert_uid_gid)
+		return sb->s_op->convert_uid_gid(user_ns, ino, retuid, retgid);
+	return 0;
+}
+
+/*
+ * s_is_capable: check whether current is capable with respect
+ * to the filesystem represented by sb.
+ *
+ * return 0 if false, 1 if true
+ */
+static inline int s_is_capable(struct inode *ino,
+		struct user_namespace *user_ns, int cap)
+{
+	struct super_block *sb = ino->i_sb;
+
+	if (sb->user_ns == user_ns)
+		return capable(cap);
+	if (sb->s_op->is_capable)
+		return sb->s_op->is_capable(user_ns, ino, cap);
+	return 0;
+}
+
+/*
  * Inode state bits.  Protected by inode_lock.
  *
  * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/10] user namespaces: Allow registering new usernamespaces using mount
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2008-08-22 19:46   ` [PATCH 04/10] user namespaces: enforce user namespaces for file permission Serge E. Hallyn
@ 2008-08-22 19:46   ` Serge E. Hallyn
  2008-08-22 19:46   ` [PATCH 06/10] user namespaces: hook fs/attr.c Serge E. Hallyn
                     ` (5 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:46 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

Allow registering new user namespaces using mount(MS_ADD_USERNS).
Define lib/fsuserns.c which will contain functions which filesystems
can hook into to support user namespaces.

Since fsuserns.c currently supports neither reading policy nor
storing userns info using xattrs, the support is really bogus
for now.

The following program shows how to use the mount syscall to
register a new user namespace with a filesystem:

===================================================================
 #include <stdio.h>
 #include <sys/mount.h>
 #include <errno.h>

 #define MS_SET_USERNS   (1<<25) /* Add current's user_ns as valid for sb */

/*
 * path is a path to be remounted
 * userid is a string 'user=X' where X is an integer
 */

int main(int argc, char *argv[])
{
        char *path, *userid;
        int ret;

        path = argv[1];
        userid = argv[2];

        ret = mount(path, path, NULL, MS_SET_USERNS, userid);
        if (ret)
                perror("mount");
        return ret;
}
===================================================================

Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/ext3/super.c    |   14 +++
 fs/namespace.c     |   11 +++
 include/linux/fs.h |    3 +
 lib/Makefile       |    2 +
 lib/fsuserns.c     |  226 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 256 insertions(+), 0 deletions(-)
 create mode 100644 lib/fsuserns.c

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index f38a5af..3458d25 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -719,6 +719,15 @@ static struct quotactl_ops ext3_qctl_operations = {
 };
 #endif
 
+#ifdef CONFIG_USER_NS
+extern int fsuserns_add_userns(struct super_block *sb,
+                struct user *user, void *data);
+extern int fsuserns_convert_uid_gid(struct user_namespace *ns,
+		struct inode *inode, uid_t *retuid, gid_t *retgid);
+extern int fsuserns_is_capable(struct user_namespace *ns,
+		struct inode *inode, int cap);
+#endif
+
 static const struct super_operations ext3_sops = {
 	.alloc_inode	= ext3_alloc_inode,
 	.destroy_inode	= ext3_destroy_inode,
@@ -738,6 +747,11 @@ static const struct super_operations ext3_sops = {
 	.quota_read	= ext3_quota_read,
 	.quota_write	= ext3_quota_write,
 #endif
+#ifdef CONFIG_USER_NS
+	.add_userns	= fsuserns_add_userns,
+	.is_capable	= fsuserns_is_capable,
+	.convert_uid_gid = fsuserns_convert_uid_gid,
+#endif
 };
 
 static const struct export_operations ext3_export_ops = {
diff --git a/fs/namespace.c b/fs/namespace.c
index 6e283c9..4bb8c61 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1885,6 +1885,15 @@ int copy_mount_options(const void __user * data, unsigned long *where)
 	return 0;
 }
 
+int do_add_userns(struct nameidata *nd, struct user_struct *user,
+		void *data_page)
+{
+	struct super_block *sb = nd->path.mnt->mnt_sb;
+	if (sb->s_op->add_userns)
+		return sb->s_op->add_userns(sb, user, data_page);
+	return -EINVAL;
+}
+
 /*
  * Flags is a 32-bit value that allows up to 31 non-fs dependent flags to
  * be given to the mount() call (ie: read-only, no-dev, no-suid etc).
@@ -1958,6 +1967,8 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 		retval = do_change_type(&nd, flags);
 	else if (flags & MS_MOVE)
 		retval = do_move_mount(&nd, dev_name);
+	else if (flags & MS_SET_USERNS)
+		retval = do_add_userns(&nd, current->user, data_page);
 	else
 		retval = do_new_mount(&nd, type_page, flags, mnt_flags,
 				      dev_name, data_page);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bb58c2e..492abef 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -128,6 +128,7 @@ extern int dir_notify_enable;
 #define MS_RELATIME	(1<<21)	/* Update atime relative to mtime/ctime. */
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
+#define MS_SET_USERNS	(1<<25) /* Add current's user_ns as valid for sb */
 #define MS_ACTIVE	(1<<30)
 #define MS_NOUSER	(1<<31)
 
@@ -1308,6 +1309,7 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 
+struct user_struct;
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
@@ -1325,6 +1327,7 @@ struct super_operations {
 	int (*remount_fs) (struct super_block *, int *, char *);
 	void (*clear_inode) (struct inode *);
 	void (*umount_begin) (struct super_block *);
+	int (*add_userns) (struct super_block *, struct user_struct *, void *);
 	int (*is_capable) (struct user_namespace *, struct inode *, int);
 	uid_t (*convert_uid_gid)(struct user_namespace *, struct inode *,
 						uid_t *, gid_t *);
diff --git a/lib/Makefile b/lib/Makefile
index 3b1f94b..4f80936 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -80,6 +80,8 @@ obj-$(CONFIG_HAVE_LMB) += lmb.o
 
 obj-$(CONFIG_HAVE_ARCH_TRACEHOOK) += syscall.o
 
+obj-$(CONFIG_USER_NS) += fsuserns.o
+
 hostprogs-y	:= gen_crc32table
 clean-files	:= crc32table.h
 
diff --git a/lib/fsuserns.c b/lib/fsuserns.c
new file mode 100644
index 0000000..0a9f52d
--- /dev/null
+++ b/lib/fsuserns.c
@@ -0,0 +1,226 @@
+#include <linux/err.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/user.h>
+#include <linux/user_namespace.h>
+
+/*
+ * Ok, eventually I'll want some policy loaded which looks as follows:
+ *
+ * [domains]
+ * INIT 1
+ * serge 2
+ * vs2 3
+ *
+ * [owners]
+ * serge serge.INIT
+ * vs2 root.INIT
+ *
+ * Meaning root on the host may own domain vs2, which is identified
+ * by id '3'.  user serge on the host may own domain serge, which is
+ * identified by id '2'.  In this case, when userid 40 in domain 'serge'
+ * saves a file, the file will be stored with serge's uid, and an
+ * xattr listing <2,40> meaning the file is owned by userid 40 in domain 2.
+ *
+ * BUT I don't wnat to deal with policy parsing yet, so for now I will
+ * just provide the interface for an fs to use these helpers.
+ */
+
+/*
+ * 1. When a user passes the MS_NEWUSER flag to mount(2), do_loopback()
+ * will call super_add_userns().  The fs can use fsuserns_add_userns()
+ * which will check whether current->fsuid is allowed to own the id
+ * passed in with 'userns=X' data.
+ * If so, we log the fact that X=get_user_ns(current->user_ns)
+ * and henceforth a task which is in that userns will be allowed to
+ * use the fs.
+ *
+ * 2. The fs will point its super_ops->convert_uid_gid() and
+ * super_ops->is_capable()  to fsuserns_convert_uid_gid() and
+ * fsuserns_is_capable().
+ */
+
+DEFINE_MUTEX(fsuserns_table_mutex);
+
+struct fsuserns_table_entries {
+	struct user_namespace *ns;
+	unsigned int userns_id;
+	struct list_head entries;
+};
+
+struct fsuserns_conversion_table {
+	struct super_block *sb;
+	struct list_head entries;
+	struct list_head tables;
+};
+
+LIST_HEAD(fsuserns_tables);
+
+struct fsuserns_conversion_table *find_table_locked(
+				struct super_block *sb)
+{
+	struct fsuserns_conversion_table *p;
+
+	if (list_empty(&fsuserns_tables))
+		return NULL;
+	list_for_each_entry(p, &fsuserns_tables, tables) {
+		if (p->sb == sb)
+			return p;
+	}
+
+	return NULL;
+}
+
+struct fsuserns_conversion_table *find_table(struct super_block *sb)
+{
+	struct fsuserns_conversion_table *p;
+
+	mutex_lock(&fsuserns_table_mutex);
+	p = find_table_locked(sb);
+	mutex_unlock(&fsuserns_table_mutex);
+
+	return p;
+}
+
+struct fsuserns_conversion_table *find_or_create_table(struct super_block *sb)
+{
+	struct fsuserns_conversion_table *p, *new;
+
+	p = find_table(sb);
+	if (p)
+		return p;
+
+	new = kmalloc(sizeof(*new), GFP_KERNEL);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&fsuserns_table_mutex);
+	p = find_table_locked(sb);
+	if (p)
+		goto out;
+	new->sb = sb;
+	INIT_LIST_HEAD(&new->entries);
+	list_add_tail(&new->tables, &fsuserns_tables);
+	p = new;
+
+out:
+	mutex_unlock(&fsuserns_table_mutex);
+	if (p != new)
+		kfree(new);
+	return p;
+}
+
+int fsuserns_add_userns(struct super_block *sb,
+		struct user_struct *user, void *data)
+{
+	struct user_namespace *user_ns = user->user_ns;
+	struct fsuserns_table_entries *e, *ep;
+	struct fsuserns_conversion_table *t;
+	int ret, newid;
+
+	t = find_or_create_table(sb);
+	if (!t)
+		return -ENOMEM;
+
+	/*
+	 * The creator of current's user namespace must be listed as
+	 * owning 'newid'.  But we don't have such policy yet.  So
+	 * for now, for the POC, we just check whether the creator's
+	 * userns is the sb->user_ns.
+	 * It's a bogus check!  Remove as soon as we have policy.
+	 *
+	 * Note, since unprivileged mounts are not yet allowed, we
+	 * know the owner had CAP_SYS_ADMIN to get here, so we don't
+	 * even bother checking the creator's uid.
+	 */
+	if (user_ns->creator->user_ns != sb->user_ns)
+		return -EPERM;
+	
+	ret = sscanf(data, "nsid=%d", &newid);
+	if (ret != 1)
+		return -EINVAL;
+
+	e = kmalloc(sizeof(*e), GFP_KERNEL);
+	if (!e)
+		return -ENOMEM;
+	mutex_lock(&fsuserns_table_mutex);
+	list_for_each_entry(ep, &t->entries, entries) {
+		if (ep->ns == user_ns && ep->userns_id == newid) {
+			kfree(ep);
+			goto skip;
+		}
+	}
+	e->ns = get_user_ns(user_ns);
+	e->userns_id = newid;
+	list_add_tail(&e->entries, &t->entries);
+skip:
+	mutex_unlock(&fsuserns_table_mutex);
+
+	return 0;
+}
+
+/*
+ * return 1 if we got a translation, 0 otherwise
+ */
+int fsuserns_convert_uid_gid(struct user_namespace *ns, struct inode *inode,
+                                                uid_t *retuid, gid_t *retgid)
+{
+	struct super_block *sb = inode->i_sb;
+	struct fsuserns_conversion_table *t;
+	struct fsuserns_table_entries *ep;
+
+	t = find_table(sb);
+	if (!t)
+		return 0;
+	mutex_lock(&fsuserns_table_mutex);
+	list_for_each_entry(ep, &t->entries, entries) {
+		if (ep->ns == ns)
+			goto convert;
+	}
+	mutex_unlock(&fsuserns_table_mutex);
+
+	return 0;
+convert:
+	mutex_unlock(&fsuserns_table_mutex);
+
+	/*
+	 * ok now we would look through the xattrs for the
+	 * inode to find a stored uid in this namespace.
+	 * But we don't do that yet, so we just claim failure :)
+	 * Don't worry, the is_capable() function can be more
+	 * meaningful.
+	 */
+	return 0;
+}
+
+int fsuserns_is_capable(struct user_namespace *ns, struct inode *inode,
+						int cap)
+{
+	struct super_block *sb = inode->i_sb;
+	struct fsuserns_conversion_table *t;
+	struct fsuserns_table_entries *ep;
+
+	t = find_table(sb);
+	if (!t)
+		return 0;
+	printk(KERN_NOTICE "%s: found a table\n", __func__);
+
+	mutex_lock(&fsuserns_table_mutex);
+	list_for_each_entry(ep, &t->entries, entries) {
+		if (ep->ns == ns)
+			goto convert;
+	}
+	mutex_unlock(&fsuserns_table_mutex);
+
+	return 0;
+convert:
+	mutex_unlock(&fsuserns_table_mutex);
+
+	printk(KERN_NOTICE "%s: found an entry\n", __func__);
+	if (capable(cap))
+		return 1;
+	printk(KERN_NOTICE "%s: oh, but I wasn't capable(%d)\n", __func__, cap);
+	return 0;
+}
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/10] user namespaces: hook fs/attr.c
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2008-08-22 19:46   ` [PATCH 05/10] user namespaces: Allow registering new usernamespaces using mount Serge E. Hallyn
@ 2008-08-22 19:46   ` Serge E. Hallyn
  2008-08-22 19:46   ` [PATCH 07/10] user namespaces: bad bad bad but test code Serge E. Hallyn
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:46 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

Hook fs/attr.c so things like chown are properly handled.  Note this is only
for permission checks.  We'll need to hook ext3_setattr to get the right
uids updated.

Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/attr.c             |   28 +++++++++++++++++++---------
 include/linux/sched.h |    1 +
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 26c71ba..072d367 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -22,6 +22,11 @@ int inode_change_ok(struct inode *inode, struct iattr *attr)
 {
 	int retval = -EPERM;
 	unsigned int ia_valid = attr->ia_valid;
+	uid_t iuid;
+	gid_t igid;
+
+	if (!s_convert_uid_gid(inode, current_userns(), &iuid, &igid))
+		return -EPERM;
 
 	/* If force is set do it anyway. */
 	if (ia_valid & ATTR_FORCE)
@@ -29,30 +34,30 @@ int inode_change_ok(struct inode *inode, struct iattr *attr)
 
 	/* Make sure a caller can chown. */
 	if ((ia_valid & ATTR_UID) &&
-	    (current->fsuid != inode->i_uid ||
-	     attr->ia_uid != inode->i_uid) && !capable(CAP_CHOWN))
+	    (current->fsuid != iuid ||
+	     attr->ia_uid != iuid) && !s_is_capable(inode, current_userns(), CAP_CHOWN))
 		goto error;
 
 	/* Make sure caller can chgrp. */
 	if ((ia_valid & ATTR_GID) &&
-	    (current->fsuid != inode->i_uid ||
-	    (!in_group_p(attr->ia_gid) && attr->ia_gid != inode->i_gid)) &&
-	    !capable(CAP_CHOWN))
+	    (current->fsuid != iuid ||
+	    (!in_group_p(attr->ia_gid) && attr->ia_gid != igid)) &&
+	    !s_is_capable(inode, current_userns(), CAP_CHOWN))
 		goto error;
 
 	/* Make sure a caller can chmod. */
 	if (ia_valid & ATTR_MODE) {
-		if (!is_owner_or_cap(inode))
+		if (current->fsuid != iuid || !s_is_capable(inode, current_userns(), CAP_FOWNER))
 			goto error;
 		/* Also check the setgid bit! */
 		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
-				inode->i_gid) && !capable(CAP_FSETID))
+				igid) && !s_is_capable(inode, current_userns(), CAP_FSETID))
 			attr->ia_mode &= ~S_ISGID;
 	}
 
 	/* Check for setting the inode time. */
 	if (ia_valid & (ATTR_MTIME_SET | ATTR_ATIME_SET | ATTR_TIMES_SET)) {
-		if (!is_owner_or_cap(inode))
+		if (current->fsuid != iuid || !s_is_capable(inode, current_userns(), CAP_FOWNER))
 			goto error;
 	}
 fine:
@@ -66,6 +71,11 @@ EXPORT_SYMBOL(inode_change_ok);
 int inode_setattr(struct inode * inode, struct iattr * attr)
 {
 	unsigned int ia_valid = attr->ia_valid;
+	uid_t iuid;
+	gid_t igid;
+
+	if (!s_convert_uid_gid(inode, current_userns(), &iuid, &igid))
+		return -EPERM;
 
 	if (ia_valid & ATTR_SIZE &&
 	    attr->ia_size != i_size_read(inode)) {
@@ -90,7 +100,7 @@ int inode_setattr(struct inode * inode, struct iattr * attr)
 	if (ia_valid & ATTR_MODE) {
 		umode_t mode = attr->ia_mode;
 
-		if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+		if (!in_group_p(igid) && !s_is_capable(inode, current_userns(), CAP_FSETID))
 			mode &= ~S_ISGID;
 		inode->i_mode = mode;
 	}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2f1356..f557535 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -611,6 +611,7 @@ struct user_struct {
 #endif
 };
 
+#define current_userns() (current->user->user_ns)
 extern int uids_sysfs_init(void);
 
 extern struct user_struct *find_user(uid_t);
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/10] user namespaces: bad bad bad but test code
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2008-08-22 19:46   ` [PATCH 06/10] user namespaces: hook fs/attr.c Serge E. Hallyn
@ 2008-08-22 19:46   ` Serge E. Hallyn
  2008-08-22 19:47   ` [PATCH 08/10] userns: store child userns uids as xattrs in ext3 using lib/fsuserns Serge E. Hallyn
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:46 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

Let uid 0 in a child namespace whose creator owns a file,
access that file.

This of course means that user hallyn (if he is allowed to
remount / for his userns, i.e. through
	capset cap_sys_admin=ep usernsremount
can create files owned by root.

So this is only so we can play.  This code will be removed
in favor of code doing "the right thing" using extended
attributes.  Then, when the above user creates a file,
the inode->iuid will be set to 500 (hallyn), and an
xattr named fs.userns=(<nsid>,0) will store the fact that
in the given nsid (might be 1 for instance) uid 0 owns
the file.

Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 lib/fsuserns.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/lib/fsuserns.c b/lib/fsuserns.c
index 0a9f52d..c237d1d 100644
--- a/lib/fsuserns.c
+++ b/lib/fsuserns.c
@@ -185,6 +185,15 @@ int fsuserns_convert_uid_gid(struct user_namespace *ns, struct inode *inode,
 convert:
 	mutex_unlock(&fsuserns_table_mutex);
 
+	/* The following is BAD CODE.  IT's for testing only */
+	if (current->uid == 0) {
+		if (inode->i_uid == ns->creator->uid) {
+			*retuid = 0;
+			*retgid = 0;
+			return 1;
+		}
+	}
+
 	/*
 	 * ok now we would look through the xattrs for the
 	 * inode to find a stored uid in this namespace.
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/10] userns: store child userns uids as xattrs in ext3 using lib/fsuserns
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2008-08-22 19:46   ` [PATCH 07/10] user namespaces: bad bad bad but test code Serge E. Hallyn
@ 2008-08-22 19:47   ` Serge E. Hallyn
  2008-08-22 19:47   ` [PATCH 09/10] userns: have ext3 use fsuserns to read userns xattrs, and add groups to userns Serge E. Hallyn
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:47 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

userns: store child userns uids as xattrs in ext3 using lib/fsuserns

Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/ext3/namei.c                |    7 +++-
 fs/ext3/xattr.c                |   29 +++++++++++++++
 fs/ext3/xattr.h                |    2 +
 include/linux/user_namespace.h |    1 +
 kernel/user_namespace.c        |    1 +
 lib/fsuserns.c                 |   74 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 113 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index de13e91..e5be4bc 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1676,6 +1676,9 @@ static int ext3_add_nondir(handle_t *handle,
 	return err;
 }
 
+int fsuserns_store_creds(struct inode *inode, struct user_struct *user,
+	int (*)(struct inode *inode, const void *value, size_t value_len));
+
 /*
  * By the time this is called, we already have created
  * the directory cache entry for the new file, but it
@@ -1707,7 +1710,9 @@ retry:
 		inode->i_op = &ext3_file_inode_operations;
 		inode->i_fop = &ext3_file_operations;
 		ext3_set_aops(inode);
-		err = ext3_add_nondir(handle, dentry, inode);
+		err = fsuserns_store_creds(inode, current->user, ext3_xattr_set_userns);
+		if (!err)
+			err = ext3_add_nondir(handle, dentry, inode);
 	}
 	ext3_journal_stop(handle);
 	if (err == -ENOSPC && ext3_should_retry_alloc(dir->i_sb, &retries))
diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
index 175414a..da47c35 100644
--- a/fs/ext3/xattr.c
+++ b/fs/ext3/xattr.c
@@ -1070,6 +1070,35 @@ retry:
 	return error;
 }
 
+int
+ext3_xattr_set_userns(struct inode *inode,
+	       const void *value, size_t value_len)
+{
+	int name_index = EXT3_XATTR_INDEX_SECURITY;
+	handle_t *handle;
+	int error, retries = 0;
+	char *name = "userns";
+
+retry:
+	handle = ext3_journal_start(inode, EXT3_DATA_TRANS_BLOCKS(inode->i_sb));
+	if (IS_ERR(handle)) {
+		error = PTR_ERR(handle);
+	} else {
+		int error2;
+
+		error = ext3_xattr_set_handle(handle, inode, name_index, name,
+					      value, value_len, flags);
+		error2 = ext3_journal_stop(handle);
+		if (error == -ENOSPC &&
+		    ext3_should_retry_alloc(inode->i_sb, &retries))
+			goto retry;
+		if (error == 0)
+			error = error2;
+	}
+
+	return error;
+}
+
 /*
  * ext3_xattr_delete_inode()
  *
diff --git a/fs/ext3/xattr.h b/fs/ext3/xattr.h
index 148a4df..8a523de 100644
--- a/fs/ext3/xattr.h
+++ b/fs/ext3/xattr.h
@@ -70,6 +70,8 @@ extern int ext3_xattr_get(struct inode *, int, const char *, void *, size_t);
 extern int ext3_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
 extern int ext3_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
 
+extern int ext3_xattr_set_userns(struct inode *inode, const void *value, size_t value_len, int flags);
+
 extern void ext3_xattr_delete_inode(handle_t *, struct inode *);
 extern void ext3_xattr_put_super(struct super_block *);
 
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 1b4959d..a793263 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -14,6 +14,7 @@ struct user_namespace {
 	struct hlist_head	uidhash_table[UIDHASH_SZ];
 	struct user_struct	*root_user;
 	struct user_struct	*creator;
+	gid_t			creator_grp;
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 39aea7b..879693a 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -51,6 +51,7 @@ int create_new_userns(int flags, struct task_struct *tsk)
 	put_user_ns(ns);
 
 	task_switch_uid(tsk, ns->root_user);
+	ns->creator_grp = tsk->gid;
 	tsk->uid = tsk->euid = tsk->suid = tsk->fsuid = 0;
 	tsk->gid = tsk->egid = tsk->sgid = tsk->fsgid = 0;
 
diff --git a/lib/fsuserns.c b/lib/fsuserns.c
index c237d1d..db70970 100644
--- a/lib/fsuserns.c
+++ b/lib/fsuserns.c
@@ -5,6 +5,7 @@
 #include <linux/fs.h>
 #include <linux/user.h>
 #include <linux/user_namespace.h>
+#include <linux/xattr.h>
 
 /*
  * Ok, eventually I'll want some policy loaded which looks as follows:
@@ -233,3 +234,76 @@ convert:
 	printk(KERN_NOTICE "%s: oh, but I wasn't capable(%d)\n", __func__, cap);
 	return 0;
 }
+
+/*
+ * when a user_namespace is registered with an fs, we store the
+ * nsid.  This next fn will need to retreive an nsid for a
+ * given fs (inode->i_sb, that is) and user_namespace
+ *
+ * I don't want to do that bookkeeping yet, so i just return 1 :)
+ */
+int find_ns_id(struct inode *inode, struct user_namespace *ns)
+{
+	struct fsuserns_conversion_table *t;
+	struct fsuserns_table_entries *ep;
+
+	t = find_table(inode->i_sb);
+	mutex_lock(&fsuserns_table_mutex);
+	list_for_each_entry(ep, &t->entries, entries) {
+		if (ep->ns == ns)
+			goto found;
+	}
+	ep = NULL;
+found:
+	mutex_unlock(&fsuserns_table_mutex);
+
+	if (!ep)
+		return -1;
+	return ep->userns_id;
+
+
+}
+
+struct unsstore {
+	int ns;
+	uid_t uid;
+};
+
+int fsuserns_store_creds(struct inode *inode, struct user_struct *user,
+	int (*xattrset)(struct inode *inode, const void *value, size_t value_len))
+{
+	struct user_namespace *ns = user->user_ns, *lastns;
+	size_t size;
+	int i, depth, ret;
+	struct unsstore *unsstore;
+	struct user_struct *creator = user;
+	
+	if (ns == &init_user_ns)
+		return 0;
+
+	depth = 0;
+	while (ns != &init_user_ns) {
+		depth++;
+		creator = ns->creator;
+		ns = creator->user_ns;
+	}
+	size = depth * sizeof(struct unsstore);
+	unsstore = kmalloc(size, GFP_KERNEL);
+	ns = user->user_ns;
+	for (i=0; ns != &init_user_ns; i++) {
+		unsstore[i].ns = find_ns_id(inode, ns);
+		unsstore[i].uid = user->uid;
+		printk(KERN_NOTICE "%s: setting xattr with ns=%d,uid=%d\n", __func__,
+				unsstore[i].ns, unsstore[i].uid);
+		user = ns->creator;
+		lastns = ns;
+		ns = user->user_ns;
+	}
+
+	inode->i_uid = creator->uid;
+	inode->i_gid = lastns->creator_grp;
+
+	ret = xattrset(inode, unsstore, size);
+	kfree(unsstore);
+	return ret;
+}
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/10] userns: have ext3 use fsuserns to read userns xattrs, and add groups to userns
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2008-08-22 19:47   ` [PATCH 08/10] userns: store child userns uids as xattrs in ext3 using lib/fsuserns Serge E. Hallyn
@ 2008-08-22 19:47   ` Serge E. Hallyn
  2008-08-22 19:47   ` [PATCH 10/10] userns: add support for readdir Serge E. Hallyn
  2008-08-22 20:41   ` [0/10] User namespaces: introduction Eric W. Biederman
  10 siblings, 0 replies; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:47 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

userns: have ext3 use fsuserns to read userns xattrs, and add groups to userns

Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/ext3/super.c    |   11 +++++++++--
 fs/ext3/xattr.c    |   19 ++++++++++++++++++-
 fs/ext3/xattr.h    |    3 ++-
 include/linux/fs.h |    2 +-
 lib/fsuserns.c     |   48 +++++++++++++++++++++++++++++++++++++++++-------
 5 files changed, 71 insertions(+), 12 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 3458d25..37c8404 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -723,11 +723,18 @@ static struct quotactl_ops ext3_qctl_operations = {
 extern int fsuserns_add_userns(struct super_block *sb,
                 struct user *user, void *data);
 extern int fsuserns_convert_uid_gid(struct user_namespace *ns,
-		struct inode *inode, uid_t *retuid, gid_t *retgid);
+		struct inode *inode, uid_t *retuid, gid_t *retgid,
+		int (*xattrget)(struct inode *, const void *, size_t));
 extern int fsuserns_is_capable(struct user_namespace *ns,
 		struct inode *inode, int cap);
 #endif
 
+int ext3_convert_uid_gid(struct user_namespace *ns, struct inode *inode,
+                        uid_t *retuid, gid_t *retgid)
+{
+	return fsuserns_convert_uid_gid(ns, inode, retuid, retgid, ext3_xattr_get_userns);
+}
+
 static const struct super_operations ext3_sops = {
 	.alloc_inode	= ext3_alloc_inode,
 	.destroy_inode	= ext3_destroy_inode,
@@ -750,7 +757,7 @@ static const struct super_operations ext3_sops = {
 #ifdef CONFIG_USER_NS
 	.add_userns	= fsuserns_add_userns,
 	.is_capable	= fsuserns_is_capable,
-	.convert_uid_gid = fsuserns_convert_uid_gid,
+	.convert_uid_gid = ext3_convert_uid_gid,
 #endif
 };
 
diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
index da47c35..500fec7 100644
--- a/fs/ext3/xattr.c
+++ b/fs/ext3/xattr.c
@@ -331,6 +331,23 @@ ext3_xattr_get(struct inode *inode, int name_index, const char *name,
 	return error;
 }
 
+int
+ext3_xattr_get_userns(struct inode *inode, void *value, size_t value_size)
+{
+	int error;
+	int name_index = EXT3_XATTR_INDEX_SECURITY;
+	char *name = "userns";
+
+	down_read(&EXT3_I(inode)->xattr_sem);
+	error = ext3_xattr_ibody_get(inode, name_index, name, value,
+				     value_size);
+	if (error == -ENODATA)
+		error = ext3_xattr_block_get(inode, name_index, name, value,
+					     value_size);
+	up_read(&EXT3_I(inode)->xattr_sem);
+	return error;
+}
+
 static int
 ext3_xattr_list_entries(struct inode *inode, struct ext3_xattr_entry *entry,
 			char *buffer, size_t buffer_size)
@@ -1087,7 +1104,7 @@ retry:
 		int error2;
 
 		error = ext3_xattr_set_handle(handle, inode, name_index, name,
-					      value, value_len, flags);
+					      value, value_len, 0);
 		error2 = ext3_journal_stop(handle);
 		if (error == -ENOSPC &&
 		    ext3_should_retry_alloc(inode->i_sb, &retries))
diff --git a/fs/ext3/xattr.h b/fs/ext3/xattr.h
index 8a523de..8c5b982 100644
--- a/fs/ext3/xattr.h
+++ b/fs/ext3/xattr.h
@@ -70,7 +70,8 @@ extern int ext3_xattr_get(struct inode *, int, const char *, void *, size_t);
 extern int ext3_xattr_set(struct inode *, int, const char *, const void *, size_t, int);
 extern int ext3_xattr_set_handle(handle_t *, struct inode *, int, const char *, const void *, size_t, int);
 
-extern int ext3_xattr_set_userns(struct inode *inode, const void *value, size_t value_len, int flags);
+extern int ext3_xattr_get_userns(struct inode *inode, void *value, size_t value_len);
+extern int ext3_xattr_set_userns(struct inode *inode, const void *value, size_t value_len);
 
 extern void ext3_xattr_delete_inode(handle_t *, struct inode *);
 extern void ext3_xattr_put_super(struct super_block *);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 492abef..9ec6dac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1329,7 +1329,7 @@ struct super_operations {
 	void (*umount_begin) (struct super_block *);
 	int (*add_userns) (struct super_block *, struct user_struct *, void *);
 	int (*is_capable) (struct user_namespace *, struct inode *, int);
-	uid_t (*convert_uid_gid)(struct user_namespace *, struct inode *,
+	int (*convert_uid_gid)(struct user_namespace *, struct inode *,
 						uid_t *, gid_t *);
 
 	int (*show_options)(struct seq_file *, struct vfsmount *);
diff --git a/lib/fsuserns.c b/lib/fsuserns.c
index db70970..ac0ca99 100644
--- a/lib/fsuserns.c
+++ b/lib/fsuserns.c
@@ -59,6 +59,12 @@ struct fsuserns_conversion_table {
 
 LIST_HEAD(fsuserns_tables);
 
+struct unsstore {
+	int ns;
+	uid_t uid;
+	gid_t gid;
+};
+
 struct fsuserns_conversion_table *find_table_locked(
 				struct super_block *sb)
 {
@@ -166,11 +172,15 @@ skip:
  * return 1 if we got a translation, 0 otherwise
  */
 int fsuserns_convert_uid_gid(struct user_namespace *ns, struct inode *inode,
-                                                uid_t *retuid, gid_t *retgid)
+			uid_t *retuid, gid_t *retgid,
+			int (*xattrget)(struct inode *inode, void *value, size_t value_len))
 {
 	struct super_block *sb = inode->i_sb;
 	struct fsuserns_conversion_table *t;
 	struct fsuserns_table_entries *ep;
+	size_t valuelen;
+	struct unsstore *unsstore;
+	int i, ret;
 
 	t = find_table(sb);
 	if (!t)
@@ -186,6 +196,23 @@ int fsuserns_convert_uid_gid(struct user_namespace *ns, struct inode *inode,
 convert:
 	mutex_unlock(&fsuserns_table_mutex);
 
+	/* look for an xattr */
+	/* yes, 3 needs to be made adjustable */
+	valuelen = 3;
+	unsstore = kzalloc(3*sizeof(struct unsstore), GFP_KERNEL);
+	ret = xattrget(inode, unsstore, valuelen);
+	if (ret < 0)
+		return ret;
+	for (i=0; i<3; i++)
+		if (unsstore[i].ns == ep->userns_id)
+			break;
+	if (i==3)
+		goto out;
+	*retuid = unsstore[i].uid;
+	*retgid = unsstore[i].gid;
+	return 1;
+
+out:
 	/* The following is BAD CODE.  IT's for testing only */
 	if (current->uid == 0) {
 		if (inode->i_uid == ns->creator->uid) {
@@ -264,11 +291,6 @@ found:
 
 }
 
-struct unsstore {
-	int ns;
-	uid_t uid;
-};
-
 int fsuserns_store_creds(struct inode *inode, struct user_struct *user,
 	int (*xattrset)(struct inode *inode, const void *value, size_t value_len))
 {
@@ -290,11 +312,23 @@ int fsuserns_store_creds(struct inode *inode, struct user_struct *user,
 	size = depth * sizeof(struct unsstore);
 	unsstore = kmalloc(size, GFP_KERNEL);
 	ns = user->user_ns;
+	lastns = NULL;
 	for (i=0; ns != &init_user_ns; i++) {
 		unsstore[i].ns = find_ns_id(inode, ns);
 		unsstore[i].uid = user->uid;
+		/*
+		 * this is too bad.  But putting grp on user_struct wouldn't work
+		 * (think about if i do
+		* 		clone(CLONE_NEWUSER);
+		 * 		change grp,
+		 * 		clone(CLONE_NEWUSER)
+		 * */
+		if (!lastns)  
+			unsstore[i].gid = current->gid;
+		else
+			unsstore[i].gid = lastns->creator_grp;
 		printk(KERN_NOTICE "%s: setting xattr with ns=%d,uid=%d\n", __func__,
-				unsstore[i].ns, unsstore[i].uid);
+				unsstore[i].ns, unsstore[i].uid, unsstore[i].gid);
 		user = ns->creator;
 		lastns = ns;
 		ns = user->user_ns;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/10] userns: add support for readdir
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2008-08-22 19:47   ` [PATCH 09/10] userns: have ext3 use fsuserns to read userns xattrs, and add groups to userns Serge E. Hallyn
@ 2008-08-22 19:47   ` Serge E. Hallyn
  2008-08-22 20:41   ` [0/10] User namespaces: introduction Eric W. Biederman
  10 siblings, 0 replies; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-22 19:47 UTC (permalink / raw)
  To: Eric W. Biederman, Linux Containers

Now ls works correctly inside a userns!

(but don't go doing some sort of setattr like 'chown'  :)

Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 fs/ext3/file.c  |    4 ++++
 fs/ext3/inode.c |   22 ++++++++++++++++++++++
 fs/ext3/namei.c |    3 +++
 fs/ext3/xattr.c |    6 ++++++
 lib/fsuserns.c  |   42 +++++++++++++++++++++++++++++-------------
 5 files changed, 64 insertions(+), 13 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index acc4913..b259061 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -106,6 +106,9 @@ force_commit:
 	return ret;
 }
 
+extern int ext3_getattr(struct vfsmount *mnt, struct dentry *dentry,
+					struct kstat *stat);
+
 const struct file_operations ext3_file_operations = {
 	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
@@ -134,5 +137,6 @@ const struct inode_operations ext3_file_inode_operations = {
 	.removexattr	= generic_removexattr,
 #endif
 	.permission	= ext3_permission,
+	.getattr	= ext3_getattr,
 };
 
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 507d868..b252490 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -36,6 +36,7 @@
 #include <linux/mpage.h>
 #include <linux/uio.h>
 #include <linux/bio.h>
+#include <linux/security.h>
 #include "xattr.h"
 #include "acl.h"
 
@@ -3088,6 +3089,27 @@ err_out:
 	return error;
 }
 
+int ext3_getattr(struct vfsmount *mnt, struct dentry *dentry,
+					struct kstat *stat)
+{
+	struct inode *inode = dentry->d_inode;
+	int retval;
+	uid_t uid;
+	gid_t gid;
+
+	retval = security_inode_getattr(mnt, dentry);
+	if (retval)
+		return retval;
+
+	generic_fillattr(inode, stat);
+
+	retval = s_convert_uid_gid(inode, current->user->user_ns, &uid, &gid);
+	if (retval == 1) {
+		stat->uid = uid;
+		stat->gid = gid;
+	}
+	return 0;
+}
 
 /*
  * How many blocks doth make a writepage()?
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index e5be4bc..fe7350b 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -2410,6 +2410,8 @@ end_rename:
 	return retval;
 }
 
+extern int ext3_getattr(struct vfsmount *mnt, struct dentry *dentry,
+					struct kstat *stat);
 /*
  * directories can handle most operations...
  */
@@ -2431,6 +2433,7 @@ const struct inode_operations ext3_dir_inode_operations = {
 	.removexattr	= generic_removexattr,
 #endif
 	.permission	= ext3_permission,
+	.getattr	= ext3_getattr,
 };
 
 const struct inode_operations ext3_special_inode_operations = {
diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
index 500fec7..cf7dc63 100644
--- a/fs/ext3/xattr.c
+++ b/fs/ext3/xattr.c
@@ -345,6 +345,7 @@ ext3_xattr_get_userns(struct inode *inode, void *value, size_t value_size)
 		error = ext3_xattr_block_get(inode, name_index, name, value,
 					     value_size);
 	up_read(&EXT3_I(inode)->xattr_sem);
+	printk(KERN_NOTICE "%s: returning %d for %lu\n", __func__, error, inode->i_ino);
 	return error;
 }
 
@@ -1102,7 +1103,12 @@ retry:
 		error = PTR_ERR(handle);
 	} else {
 		int error2;
+		char *buf;
+		int i;
 
+		printk(KERN_NOTICE "%s: writing %d bytes:\n", __func__, value_len);
+		for (i=0, buf = (char *)value; i<value_len; i++,buf++)
+			printk(KERN_NOTICE "%s: %d %x\n", __func__, i, (int)*buf);
 		error = ext3_xattr_set_handle(handle, inode, name_index, name,
 					      value, value_len, 0);
 		error2 = ext3_journal_stop(handle);
diff --git a/lib/fsuserns.c b/lib/fsuserns.c
index ac0ca99..f0be780 100644
--- a/lib/fsuserns.c
+++ b/lib/fsuserns.c
@@ -179,7 +179,7 @@ int fsuserns_convert_uid_gid(struct user_namespace *ns, struct inode *inode,
 	struct fsuserns_conversion_table *t;
 	struct fsuserns_table_entries *ep;
 	size_t valuelen;
-	struct unsstore *unsstore;
+	struct unsstore *unsstore = NULL;
 	int i, ret;
 
 	t = find_table(sb);
@@ -196,23 +196,33 @@ int fsuserns_convert_uid_gid(struct user_namespace *ns, struct inode *inode,
 convert:
 	mutex_unlock(&fsuserns_table_mutex);
 
-	/* look for an xattr */
-	/* yes, 3 needs to be made adjustable */
-	valuelen = 3;
-	unsstore = kzalloc(3*sizeof(struct unsstore), GFP_KERNEL);
+	ret = xattrget(inode, NULL, 0);
+	if (ret <= 0)
+		goto notfound;
+	valuelen = ret;
+	unsstore = kzalloc(ret, GFP_KERNEL);
 	ret = xattrget(inode, unsstore, valuelen);
-	if (ret < 0)
-		return ret;
-	for (i=0; i<3; i++)
+	if (ret <= 0)
+		goto notfound;
+	for (i=0; i<(ret/sizeof(*unsstore)); i++) {
+		printk(KERN_NOTICE "%s: comparing unstore id %d to userns id %d\n",
+			__func__, unsstore[i].ns, ep->userns_id);
 		if (unsstore[i].ns == ep->userns_id)
-			break;
-	if (i==3)
-		goto out;
+			goto found;
+	}
+	goto notfound;
+
+found:
 	*retuid = unsstore[i].uid;
 	*retgid = unsstore[i].gid;
+	printk(KERN_NOTICE "%s: found a uid (%d) for nsid %d\n",
+			__func__, *retuid, ep->userns_id);
+	kfree(unsstore);
 	return 1;
 
-out:
+notfound:
+	kfree(unsstore);
+	printk(KERN_NOTICE "%s: no uid for my ns found\n", __func__);
 	/* The following is BAD CODE.  IT's for testing only */
 	if (current->uid == 0) {
 		if (inode->i_uid == ns->creator->uid) {
@@ -291,6 +301,12 @@ found:
 
 }
 
+/*
+ * Let's say uid 500 in the init_user_ns created (nsid=3), and uid 400
+ * there created (nsid=5).  Then root in nsid=5 creates a file.
+ * We want to store 500 as the inode->iuid.  In xattr security.userns
+ * we store (3,400) and (5,0)
+ */
 int fsuserns_store_creds(struct inode *inode, struct user_struct *user,
 	int (*xattrset)(struct inode *inode, const void *value, size_t value_len))
 {
@@ -327,7 +343,7 @@ int fsuserns_store_creds(struct inode *inode, struct user_struct *user,
 			unsstore[i].gid = current->gid;
 		else
 			unsstore[i].gid = lastns->creator_grp;
-		printk(KERN_NOTICE "%s: setting xattr with ns=%d,uid=%d\n", __func__,
+		printk(KERN_NOTICE "%s: setting xattr with ns=%d,uid=%d,gid=%d\n", __func__,
 				unsstore[i].ns, unsstore[i].uid, unsstore[i].gid);
 		user = ns->creator;
 		lastns = ns;
-- 
1.5.4.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission
       [not found]     ` <20080822194609.GD10360-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-22 20:13       ` Eric W. Biederman
       [not found]         ` <m1ej4glsen.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  2008-08-22 21:13       ` Eric W. Biederman
  1 sibling, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2008-08-22 20:13 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Add a user_ns to the sb.  It is always set to the user_ns of the task which
> mounted the sb.
>
> Define 3 new super_operations.  convert_uid() and convert_gid() take a uid
> or gid from an inode on the sb's fs, and attempt to convert them into ids
> meaningful in the user namespace passed in, which presumably is current's.
> is_capable() checks whether current has the requested capability with respect
> to the sb's fs, which is dependent upon the relationship between current's
> user_ns and those which the sb knows about.
>
> If the sb isn't user ns aware - which none are right now - then the new
> super_operations won't be defined.  If in that case current and sb have
> different user namespaces, then the userids can't be compared.
>
> If current's and sb's userids can't be compared, then current will get
> 'user other' (we usually say 'nobody') permissions to the inode.
>
> Changelog:
> 	Aug 5: send the whole inode to the super_operations.
>
> Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/namei.c         |   29 +++++++++++++++++++++++---
>  fs/super.c         |    3 ++
>  include/linux/fs.h |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 83 insertions(+), 4 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 4ea63ed..6bf5446 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -183,10 +183,27 @@ int generic_permission(struct inode *inode, int mask,
>  		int (*check_acl)(struct inode *inode, int mask))
>  {
>  	umode_t			mode = inode->i_mode;
> +	uid_t			iuid;
> +	gid_t			igid;
> +	int			ret;
> +	int			good_userns = 1;
> +
> +	ret = s_convert_uid_gid(inode, current->user->user_ns,
> +						&iuid, &igid);
> +	if (!ret)
> +		good_userns = 0;
>  
>  	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
>  
> -	if (current->fsuid == inode->i_uid)
> +	/*
> +	 * If we're not in the inode's user namespace, we get
> +	 * user nobody permissions, and we ignore acls
> +	 * (bc serge doesn't know how to handle acls in this case)
> +	 */
> +	if (!good_userns)
> +		goto check;
> +
> +	if (current->fsuid == iuid)
>  		mode >>= 6;
>  	else {
>  		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
> @@ -197,15 +214,18 @@ int generic_permission(struct inode *inode, int mask,
>  				return error;
>  		}
>  
> -		if (in_group_p(inode->i_gid))
> +		if (in_group_p(igid))
>  			mode >>= 3;
>  	}
>  
> +check:
>  	/*
>  	 * If the DACs are ok we don't need any capability check.
>  	 */
>  	if ((mask & ~mode) == 0)
>  		return 0;
> +	if (!good_userns)
> +		return -EACCES;
>  
>   check_capabilities:
>  	/*
> @@ -214,14 +234,15 @@ int generic_permission(struct inode *inode, int mask,
>  	 */
>  	if (!(mask & MAY_EXEC) ||
>  	    (inode->i_mode & S_IXUGO) || S_ISDIR(inode->i_mode))
> -		if (capable(CAP_DAC_OVERRIDE))
> +		if (s_is_capable(inode, current->user->user_ns,
> +						CAP_DAC_OVERRIDE))
>  			return 0;
>  
>  	/*
>  	 * Searching includes executable on directories, else just read.
>  	 */
>  	if (mask == MAY_READ || (S_ISDIR(inode->i_mode) && !(mask & MAY_WRITE)))
> -		if (capable(CAP_DAC_READ_SEARCH))
> + if (s_is_capable(inode, current->user->user_ns, CAP_DAC_READ_SEARCH))
>  			return 0;
>  
>  	return -EACCES;
> diff --git a/fs/super.c b/fs/super.c
> index e931ae9..3559637 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -38,6 +38,7 @@
>  #include <linux/kobject.h>
>  #include <linux/mutex.h>
>  #include <linux/file.h>
> +#include <linux/user_namespace.h>
>  #include <asm/uaccess.h>
>  #include "internal.h"
>  
> @@ -93,6 +94,7 @@ static struct super_block *alloc_super(struct file_system_type
> *type)
>  		s->s_qcop = sb_quotactl_ops;
>  		s->s_op = &default_op;
>  		s->s_time_gran = 1000000000;
> +		s->user_ns = get_user_ns(current->user->user_ns);
>  	}
>  out:
>  	return s;
> @@ -109,6 +111,7 @@ static inline void destroy_super(struct super_block *s)
>  	security_sb_free(s);
>  	kfree(s->s_subtype);
>  	kfree(s->s_options);
> +	put_user_ns(s->user_ns);
>  	kfree(s);
>  }
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 580b513..bb58c2e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1128,6 +1128,11 @@ struct super_block {
>  	 * generic_show_options()
>  	 */
>  	char *s_options;
> +
> +	/*
> +	 * namespace of the user which mounted the sb
> +	 */
> +	struct user_namespace *user_ns;

We should make it clear that this is a default, and not something
a filesystem must use, just something a filesystem may use.
>  };
>  
>  extern struct timespec current_fs_time(struct super_block *sb);
> @@ -1320,6 +1325,9 @@ struct super_operations {
>  	int (*remount_fs) (struct super_block *, int *, char *);
>  	void (*clear_inode) (struct inode *);
>  	void (*umount_begin) (struct super_block *);
> +	int (*is_capable) (struct user_namespace *, struct inode *, int);
> +	uid_t (*convert_uid_gid)(struct user_namespace *, struct inode *,
> +						uid_t *, gid_t *);
>  
I'm not convinced either of those methods makes sense.
>  	int (*show_options)(struct seq_file *, struct vfsmount *);
>  	int (*show_stats)(struct seq_file *, struct vfsmount *);
> @@ -1330,6 +1338,53 @@ struct super_operations {
>  };
>  
>  /*
> + * In the next 3, i'm passing the user_ns in so that I don't need
> + * to know how to dereference struct user her (which would require
> + * #including sched.h).
> + * Note in particular that for instance in s_convert_uid, the uid is the
> + * inode->i_uid, while user_ns is current->user->user_ns.
> + */
> +
> +/*
> + * s_convert_uid: attempt to translate the inode's stored
> + * uid to a uid meaningful in user_ns passed in
> + * If possible, store the result in reuid and return 1
> + * Otherwise, return 0, meaningful the uid cannot be translated
> + */
> +static inline int s_convert_uid_gid(struct inode *ino,
> +		struct user_namespace *user_ns, uid_t *retuid, gid_t *retgid)
> +{
> +	struct super_block *sb = ino->i_sb;
> +
> +	if (sb->user_ns == user_ns) {
> +		*retuid = ino->i_uid;
> +		*retgid = ino->i_gid;
> +		return 1;
> +	}
> +	if (sb->s_op->convert_uid_gid)
> +		return sb->s_op->convert_uid_gid(user_ns, ino, retuid, retgid);

We should be able to just use the existing inode->i_op->getattr.
Especially as that method will have to be updated anyway..

> +	return 0;
> +}
> +
> +/*
> + * s_is_capable: check whether current is capable with respect
> + * to the filesystem represented by sb.
> + *
> + * return 0 if false, 1 if true
> + */
???  Capable should be with respect to a user namespace not with
respect to a filesystem.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [0/10] User namespaces: introduction
       [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2008-08-22 19:47   ` [PATCH 10/10] userns: add support for readdir Serge E. Hallyn
@ 2008-08-22 20:41   ` Eric W. Biederman
       [not found]     ` <m1d4k0ixzp.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  10 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2008-08-22 20:41 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Hi Eric,
>
> so here is a start to a userns patchset trying to follow your ideas
> about how to have user namespaces and filesystems interact.  Ignore
> the bookkeeping crap or you'll pull your hair out.  Lots of stuff
> remains unimplemented - i.e. chown (setattr) and proper handling of
> capabilities.  But you can do some fun things with this patchset.
> I.e.
>
> 	(log in as root)
> 	setcap cap_sys_admin=ep ns_exec
> 	setcap cap_sys_admin=ep usernsmount
> 	ns_exec -U /bin/sh
> 	ls /root (fails)
> 	ls / (succeeds)
> 	(log in as hallyn)
> 	ns_exec -U /bin/sh
> 	id
> 		(uid=0, gid=0)
> 	ls (fails, can't descend /home/hallyn)
> 	usernsmount / nsid=4
> 	ls (succeeds)
> 	touch ab
> 	ls -l ab
> 		(ab is owned by root)
> 	exit
> 	(we're logged in as hallyn in the init_user_ns again)
> 	ls -l ab
> 		(ab is owned by hallyn)
>
> The only supported fs is ext3.  Only a few operations are supported.
> So if, above, when we are hallyn in the init_user_ns but root in
> the child user ns,
> 	when we create a file, it is properly handled, so
> 		inode->i_uid=500, but an xattr (nsid=4,uid=0) is added
> 	when we chown the file to root, it is not properly handled,
> 		so inode->i_uid = 0
> it's just a matter of hooking all the places at this point.
>
> Capabilities remain a problem.  Right now I think capabilities will
> need to be split up into system-wide caps, and container-safe caps.
> So CAP_NET_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, those are container-safe.
> CAP_REBOOT may become container-safe one day, but for now is very
> much system-wide.
>
> So if I'm uid 500 on the host and create a user namespace where I'm
> uid=0, I should be able to acquire container-safe caps (perhaps
> contingent on whether I unshared all other namespaces), but not
> system-wide ones.  Or, whether I can acquire them would depend
> on whether the suid bit was set in a user_ns or not.  sigh.

Serge at first glance this looks like a good start, especially for thinking
through how things will work.

It has just occurred to me that from a dependency point of view it
makes an enormous amount of sense to sort out capable with
respect to namespaces before we get to the filesystems.

There is no one else working in the area of capabilities so there won't
be conflicts, and we need a firm understanding of how capabilities are
going to work with respect to namespaces before we start embedding
the logic in filesystems.

With respect to your separation of capabilities in namespaces I don't think
you have quite grasped the simple idea that is sitting in my head and makes
all of this clear.  Let me see if I can explain it better.

A fully qualified capability name would be of the form:
userns:capability_name

For each operation we will check for one specific capability.
For the network namespace in particular we will check for:
userns_of_network_namespace_creator:CAP_NET_ADMIN

The check for a capability will succeed if:
- We have the exact fully qualified capability. 
- We are outside the user namespace but are the owner of
  the user namespace.
- We are outside the user namespace but have the appropriate
  capability over the owner of the user namespace  CAP_PTRACE?

  This last test would recurses.

I'm less certain than I like about which permissions we allow someone outside
of a container to posses and still control the container.

This has two very useful implications.
- We can have all capabilities in a new user namespace and be completely
  impotent.
- Allowing the capabilities of a user namespace to do something useful
  can come gradually.

Which means we need to extend the classic capable check to become.
capable(userns, capability).  Or possibly we extend the capability
parameter to be a structure that can hold both userns and the capability,
whichever turns out to be more maintainable.

Once we have done that we can allow something to be under the power
of creator_user_ns:capability instead of init_user_ns:capability.

So the CAP_SYS_REBOOT test will be init_user_ns:capability for the 
foreseeable future.  While the CAP_NET_ADMIN test will shortly
become creator_of_netns:CAP_NET_ADMIN.

Of course none of that will happen until we relax the test to create a
new namespace from init_user_ns:CAP_SYS_ADMIN to
current_user_ns:CAP_SYS_ADMIN.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission
       [not found]     ` <20080822194609.GD10360-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-08-22 20:13       ` Eric W. Biederman
@ 2008-08-22 21:13       ` Eric W. Biederman
       [not found]         ` <m1bpzkhhy0.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  1 sibling, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2008-08-22 21:13 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Add a user_ns to the sb.  It is always set to the user_ns of the task which
> mounted the sb.
>
> Define 3 new super_operations.  convert_uid() and convert_gid() take a uid
> or gid from an inode on the sb's fs, and attempt to convert them into ids
> meaningful in the user namespace passed in, which presumably is current's.
> is_capable() checks whether current has the requested capability with respect
> to the sb's fs, which is dependent upon the relationship between current's
> user_ns and those which the sb knows about.
>
> If the sb isn't user ns aware - which none are right now - then the new
> super_operations won't be defined.  If in that case current and sb have
> different user namespaces, then the userids can't be compared.
>
> If current's and sb's userids can't be compared, then current will get
> 'user other' (we usually say 'nobody') permissions to the inode.
>
> Changelog:
> 	Aug 5: send the whole inode to the super_operations.

Serge while I am dubious about adding the methods you have added to
super_block_operations there is one very important question you are
using them for that we do need to ask.

Does this filesystem map to user namespace X.

Can we separate that concept out as it's own individual function and initially
have a noop implementation that returns false for everything except for
the initial network namespace.

By doing just that we should be able to capture and update all of the tests without
the rest of the hassle.  Possibly this is just your s_convert_uid_gid function
without calling the superblock option.

I would really like to get to the point where we can create a user namespace
without privilege and have it be safe (if impotent).  Once we do that we can
extend out what the root of a user namespace is allowed to do, much as
we have done with the network namespace. 

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re:  [PATCH 04/10] user namespaces: enforce usernamespaces for file permission
       [not found]         ` <m1bpzkhhy0.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-08-23  0:53           ` Serge E. Hallyn
       [not found]             ` <20080823005304.GA21064-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-23  0:53 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Add a user_ns to the sb.  It is always set to the user_ns of the task which
> > mounted the sb.
> >
> > Define 3 new super_operations.  convert_uid() and convert_gid() take a uid
> > or gid from an inode on the sb's fs, and attempt to convert them into ids
> > meaningful in the user namespace passed in, which presumably is current's.
> > is_capable() checks whether current has the requested capability with respect
> > to the sb's fs, which is dependent upon the relationship between current's
> > user_ns and those which the sb knows about.
> >
> > If the sb isn't user ns aware - which none are right now - then the new
> > super_operations won't be defined.  If in that case current and sb have
> > different user namespaces, then the userids can't be compared.
> >
> > If current's and sb's userids can't be compared, then current will get
> > 'user other' (we usually say 'nobody') permissions to the inode.
> >
> > Changelog:
> > 	Aug 5: send the whole inode to the super_operations.
> 
> Serge while I am dubious about adding the methods you have added to
> super_block_operations there is one very important question you are
> using them for that we do need to ask.
> 
> Does this filesystem map to user namespace X.

By itself that is not sufficient.  We need to support two inodes on the
same fs where both have i_uid=500 on the host fs, while in user
namespace X one is owned by uid 0, and another by uid 1000.

So we need to be able to pass the filesystem an inode and a user
namespace, and ask for the owning uid and gids.

Or am I (I likely am) misunderstanding?

-serge

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission
       [not found]         ` <m1ej4glsen.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-08-23  0:57           ` Serge E. Hallyn
       [not found]             ` <20080823005715.GB21064-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-23  0:57 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Add a user_ns to the sb.  It is always set to the user_ns of the task which
> > mounted the sb.
> >
> > Define 3 new super_operations.  convert_uid() and convert_gid() take a uid
> > or gid from an inode on the sb's fs, and attempt to convert them into ids
> > meaningful in the user namespace passed in, which presumably is current's.
> > is_capable() checks whether current has the requested capability with respect
> > to the sb's fs, which is dependent upon the relationship between current's
> > user_ns and those which the sb knows about.
> >
> > If the sb isn't user ns aware - which none are right now - then the new
> > super_operations won't be defined.  If in that case current and sb have
> > different user namespaces, then the userids can't be compared.
> >
> > If current's and sb's userids can't be compared, then current will get
> > 'user other' (we usually say 'nobody') permissions to the inode.
> >
> > Changelog:
> > 	Aug 5: send the whole inode to the super_operations.
> >
> > Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> > ---
> >  fs/namei.c         |   29 +++++++++++++++++++++++---
> >  fs/super.c         |    3 ++
> >  include/linux/fs.h |   55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 83 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 4ea63ed..6bf5446 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -183,10 +183,27 @@ int generic_permission(struct inode *inode, int mask,
> >  		int (*check_acl)(struct inode *inode, int mask))
> >  {
> >  	umode_t			mode = inode->i_mode;
> > +	uid_t			iuid;
> > +	gid_t			igid;
> > +	int			ret;
> > +	int			good_userns = 1;
> > +
> > +	ret = s_convert_uid_gid(inode, current->user->user_ns,
> > +						&iuid, &igid);
> > +	if (!ret)
> > +		good_userns = 0;
> >  
> >  	mask &= MAY_READ | MAY_WRITE | MAY_EXEC;
> >  
> > -	if (current->fsuid == inode->i_uid)
> > +	/*
> > +	 * If we're not in the inode's user namespace, we get
> > +	 * user nobody permissions, and we ignore acls
> > +	 * (bc serge doesn't know how to handle acls in this case)
> > +	 */
> > +	if (!good_userns)
> > +		goto check;
> > +
> > +	if (current->fsuid == iuid)
> >  		mode >>= 6;
> >  	else {
> >  		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
> > @@ -197,15 +214,18 @@ int generic_permission(struct inode *inode, int mask,
> >  				return error;
> >  		}
> >  
> > -		if (in_group_p(inode->i_gid))
> > +		if (in_group_p(igid))
> >  			mode >>= 3;
> >  	}
> >  
> > +check:
> >  	/*
> >  	 * If the DACs are ok we don't need any capability check.
> >  	 */
> >  	if ((mask & ~mode) == 0)
> >  		return 0;
> > +	if (!good_userns)
> > +		return -EACCES;
> >  
> >   check_capabilities:
> >  	/*
> > @@ -214,14 +234,15 @@ int generic_permission(struct inode *inode, int mask,
> >  	 */
> >  	if (!(mask & MAY_EXEC) ||
> >  	    (inode->i_mode & S_IXUGO) || S_ISDIR(inode->i_mode))
> > -		if (capable(CAP_DAC_OVERRIDE))
> > +		if (s_is_capable(inode, current->user->user_ns,
> > +						CAP_DAC_OVERRIDE))
> >  			return 0;
> >  
> >  	/*
> >  	 * Searching includes executable on directories, else just read.
> >  	 */
> >  	if (mask == MAY_READ || (S_ISDIR(inode->i_mode) && !(mask & MAY_WRITE)))
> > -		if (capable(CAP_DAC_READ_SEARCH))
> > + if (s_is_capable(inode, current->user->user_ns, CAP_DAC_READ_SEARCH))
> >  			return 0;
> >  
> >  	return -EACCES;
> > diff --git a/fs/super.c b/fs/super.c
> > index e931ae9..3559637 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -38,6 +38,7 @@
> >  #include <linux/kobject.h>
> >  #include <linux/mutex.h>
> >  #include <linux/file.h>
> > +#include <linux/user_namespace.h>
> >  #include <asm/uaccess.h>
> >  #include "internal.h"
> >  
> > @@ -93,6 +94,7 @@ static struct super_block *alloc_super(struct file_system_type
> > *type)
> >  		s->s_qcop = sb_quotactl_ops;
> >  		s->s_op = &default_op;
> >  		s->s_time_gran = 1000000000;
> > +		s->user_ns = get_user_ns(current->user->user_ns);
> >  	}
> >  out:
> >  	return s;
> > @@ -109,6 +111,7 @@ static inline void destroy_super(struct super_block *s)
> >  	security_sb_free(s);
> >  	kfree(s->s_subtype);
> >  	kfree(s->s_options);
> > +	put_user_ns(s->user_ns);
> >  	kfree(s);
> >  }
> >  
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 580b513..bb58c2e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1128,6 +1128,11 @@ struct super_block {
> >  	 * generic_show_options()
> >  	 */
> >  	char *s_options;
> > +
> > +	/*
> > +	 * namespace of the user which mounted the sb
> > +	 */
> > +	struct user_namespace *user_ns;
> 
> We should make it clear that this is a default, and not something
> a filesystem must use, just something a filesystem may use.
> >  };
> >  
> >  extern struct timespec current_fs_time(struct super_block *sb);
> > @@ -1320,6 +1325,9 @@ struct super_operations {
> >  	int (*remount_fs) (struct super_block *, int *, char *);
> >  	void (*clear_inode) (struct inode *);
> >  	void (*umount_begin) (struct super_block *);
> > +	int (*is_capable) (struct user_namespace *, struct inode *, int);
> > +	uid_t (*convert_uid_gid)(struct user_namespace *, struct inode *,
> > +						uid_t *, gid_t *);
> >  
> I'm not convinced either of those methods makes sense.
> >  	int (*show_options)(struct seq_file *, struct vfsmount *);
> >  	int (*show_stats)(struct seq_file *, struct vfsmount *);
> > @@ -1330,6 +1338,53 @@ struct super_operations {
> >  };
> >  
> >  /*
> > + * In the next 3, i'm passing the user_ns in so that I don't need
> > + * to know how to dereference struct user her (which would require
> > + * #including sched.h).
> > + * Note in particular that for instance in s_convert_uid, the uid is the
> > + * inode->i_uid, while user_ns is current->user->user_ns.
> > + */
> > +
> > +/*
> > + * s_convert_uid: attempt to translate the inode's stored
> > + * uid to a uid meaningful in user_ns passed in
> > + * If possible, store the result in reuid and return 1
> > + * Otherwise, return 0, meaningful the uid cannot be translated
> > + */
> > +static inline int s_convert_uid_gid(struct inode *ino,
> > +		struct user_namespace *user_ns, uid_t *retuid, gid_t *retgid)
> > +{
> > +	struct super_block *sb = ino->i_sb;
> > +
> > +	if (sb->user_ns == user_ns) {
> > +		*retuid = ino->i_uid;
> > +		*retgid = ino->i_gid;
> > +		return 1;
> > +	}
> > +	if (sb->s_op->convert_uid_gid)
> > +		return sb->s_op->convert_uid_gid(user_ns, ino, retuid, retgid);
> 
> We should be able to just use the existing inode->i_op->getattr.
> Especially as that method will have to be updated anyway..

That might make sense.  I guess the problem is I started trying to
handle generic permission.  But I don't need to...  the fs can
provide its own permission, else we do the simple userns check.

> > +	return 0;
> > +}
> > +
> > +/*
> > + * s_is_capable: check whether current is capable with respect
> > + * to the filesystem represented by sb.
> > + *
> > + * return 0 if false, 1 if true
> > + */
> ???  Capable should be with respect to a user namespace not with
> respect to a filesystem.

The user ns comes from the current task.  The fs must decide whether
current->user->uid in current->user->user_ns has capabilities over
the target file.  Only the fs can erify that the user is really
uid=0 or has CAP_DAC_OVERRIDE, for instance, with respect to the file.

-serge

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [0/10] User namespaces: introduction
       [not found]     ` <m1d4k0ixzp.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-08-23  1:17       ` Serge E. Hallyn
       [not found]         ` <20080823011731.GA22737-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-23  1:17 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Hi Eric,
> >
> > so here is a start to a userns patchset trying to follow your ideas
> > about how to have user namespaces and filesystems interact.  Ignore
> > the bookkeeping crap or you'll pull your hair out.  Lots of stuff
> > remains unimplemented - i.e. chown (setattr) and proper handling of
> > capabilities.  But you can do some fun things with this patchset.
> > I.e.
> >
> > 	(log in as root)
> > 	setcap cap_sys_admin=ep ns_exec
> > 	setcap cap_sys_admin=ep usernsmount
> > 	ns_exec -U /bin/sh
> > 	ls /root (fails)
> > 	ls / (succeeds)
> > 	(log in as hallyn)
> > 	ns_exec -U /bin/sh
> > 	id
> > 		(uid=0, gid=0)
> > 	ls (fails, can't descend /home/hallyn)
> > 	usernsmount / nsid=4
> > 	ls (succeeds)
> > 	touch ab
> > 	ls -l ab
> > 		(ab is owned by root)
> > 	exit
> > 	(we're logged in as hallyn in the init_user_ns again)
> > 	ls -l ab
> > 		(ab is owned by hallyn)
> >
> > The only supported fs is ext3.  Only a few operations are supported.
> > So if, above, when we are hallyn in the init_user_ns but root in
> > the child user ns,
> > 	when we create a file, it is properly handled, so
> > 		inode->i_uid=500, but an xattr (nsid=4,uid=0) is added
> > 	when we chown the file to root, it is not properly handled,
> > 		so inode->i_uid = 0
> > it's just a matter of hooking all the places at this point.
> >
> > Capabilities remain a problem.  Right now I think capabilities will
> > need to be split up into system-wide caps, and container-safe caps.
> > So CAP_NET_ADMIN, CAP_NET_RAW, CAP_DAC_OVERRIDE, those are container-safe.
> > CAP_REBOOT may become container-safe one day, but for now is very
> > much system-wide.
> >
> > So if I'm uid 500 on the host and create a user namespace where I'm
> > uid=0, I should be able to acquire container-safe caps (perhaps
> > contingent on whether I unshared all other namespaces), but not
> > system-wide ones.  Or, whether I can acquire them would depend
> > on whether the suid bit was set in a user_ns or not.  sigh.
> 
> Serge at first glance this looks like a good start, especially for thinking
> through how things will work.
> 
> It has just occurred to me that from a dependency point of view it
> makes an enormous amount of sense to sort out capable with
> respect to namespaces before we get to the filesystems.
> 
> There is no one else working in the area of capabilities so there won't
> be conflicts, and we need a firm understanding of how capabilities are
> going to work with respect to namespaces before we start embedding
> the logic in filesystems.
> 
> With respect to your separation of capabilities in namespaces I don't think
> you have quite grasped the simple idea that is sitting in my head and makes
> all of this clear.  Let me see if I can explain it better.
> 
> A fully qualified capability name would be of the form:
> userns:capability_name
> 
> For each operation we will check for one specific capability.
> For the network namespace in particular we will check for:
> userns_of_network_namespace_creator:CAP_NET_ADMIN
> 
> The check for a capability will succeed if:
> - We have the exact fully qualified capability. 
> - We are outside the user namespace but are the owner of
>   the user namespace.
> - We are outside the user namespace but have the appropriate
>   capability over the owner of the user namespace  CAP_PTRACE?
> 
>   This last test would recurses.
> 
> I'm less certain than I like about which permissions we allow someone outside
> of a container to posses and still control the container.
> 
> This has two very useful implications.
> - We can have all capabilities in a new user namespace and be completely
>   impotent.
> - Allowing the capabilities of a user namespace to do something useful
>   can come gradually.
> 
> Which means we need to extend the classic capable check to become.
> capable(userns, capability).  Or possibly we extend the capability
> parameter to be a structure that can hold both userns and the capability,
> whichever turns out to be more maintainable.
> 
> Once we have done that we can allow something to be under the power
> of creator_user_ns:capability instead of init_user_ns:capability.
> 
> So the CAP_SYS_REBOOT test will be init_user_ns:capability for the 
> foreseeable future.  While the CAP_NET_ADMIN test will shortly
> become creator_of_netns:CAP_NET_ADMIN.
> 
> Of course none of that will happen until we relax the test to create a
> new namespace from init_user_ns:CAP_SYS_ADMIN to
> current_user_ns:CAP_SYS_ADMIN.
> 
> Eric

It definately seems to make sense in terms of the security
implications.  And solving this before the filesystem handlers seems
to make sense too.  Although I would like to get the first 3 patches upstream
pretty soon, as I believe they are proper fixes.

But wrt userns:capability, the problem that brings to mind is that of
referring to the userns.  Do we use the userspace-exported id, or do we
use the actual in-kernel user_ns?  If we use the in-kernel user_ns,
then we'd have to take a ref for each cap, yuck.  But you had wanted to
use 'mount' to only have filesystems associate userspace ids with the
in-kernel struct user_ns, so that complicates the idea of having
capabilities refer to those.

Anyway I like the overall approach, and will think a bit about
any other actual implementation issues.

thanks,
-serge

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission
       [not found]             ` <20080823005304.GA21064-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-23  1:56               ` Eric W. Biederman
       [not found]                 ` <m1r68gebop.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2008-08-23  1:56 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
>
> By itself that is not sufficient.  We need to support two inodes on the
> same fs where both have i_uid=500 on the host fs, while in user
> namespace X one is owned by uid 0, and another by uid 1000.
>
> So we need to be able to pass the filesystem an inode and a user
> namespace, and ask for the owning uid and gids.
>
> Or am I (I likely am) misunderstanding?

There are two questions.
Does this filesystem provide mappings to user namespace X?
What is the mapping from this filesystem to user namespace X?

I think we may be able to separate those two questions.

The important idea is that we don't need to implement filesystem changes
in the first pass.  Just have the permission check fail unconditionally
if we are not in the init_user_ns.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 04/10] user namespaces: enforce user namespaces for file permission
       [not found]             ` <20080823005715.GB21064-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-23  2:16               ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2008-08-23  2:16 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> That might make sense.  I guess the problem is I started trying to
> handle generic permission.  But I don't need to...  the fs can
> provide its own permission, else we do the simple userns check.

Sounds right.

We really need that simple starting place where we can turn it all
on and you aren't allowed to do anything, then we can enable one
feature at a time from there.

Plus I can think of some uses for a user who isn't allowed to do
anything on a machine. ;)

> The user ns comes from the current task.  The fs must decide whether
> current->user->uid in current->user->user_ns has capabilities over
> the target file.  Only the fs can erify that the user is really
> uid=0 or has CAP_DAC_OVERRIDE, for instance, with respect to the file.

In the general case only the filesystem can make the call about what
powers we have over a specific file and the filesystem will make that
call when we pass it our kerberos ticket to the filesystem server, and
ask the filesystem to perform a specific operation.

In the common case the filesystem is just a reflection of our in memory
set of namespaces.  Getting the capability rules straight are going
to be a little interesting.  Essentially however for filesystems
I believe the rule is if I have CAP_DAC_OVERRIDE in the user namespace
of a user.  And that user creates a namespace.  Then I should have
CAP_DAC_OVERRIDE capability over that namespace as well.  Allowing
me to back up files in the unprivileged users namespace.

The common case is the only case that needs to be implemented
in generic kernel code.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission
       [not found]                 ` <m1r68gebop.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-08-23  2:22                   ` Serge E. Hallyn
       [not found]                     ` <20080823022210.GA29618-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-23  2:22 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> >
> > By itself that is not sufficient.  We need to support two inodes on the
> > same fs where both have i_uid=500 on the host fs, while in user
> > namespace X one is owned by uid 0, and another by uid 1000.
> >
> > So we need to be able to pass the filesystem an inode and a user
> > namespace, and ask for the owning uid and gids.
> >
> > Or am I (I likely am) misunderstanding?
> 
> There are two questions.
> Does this filesystem provide mappings to user namespace X?
> What is the mapping from this filesystem to user namespace X?

That is where you and I still disagree:  I don't believe that a mapping
as such makes sense.

A mapping implies "uid 500 becomes uid 0, uid 501 becomes uid 1000",
etc.  That simply is not sufficient.  If I am going to be able to create
a new userns without privilege, then all files belonging to all uids in
the child userns must be owned by uid 500 in my parent uidns.

So I do think we will have to have something more of the format "what is
the owning uid and gid of this inode, in the context of user namespace X."
Which was the s_convert_uid_gid() in my helpers.

If we follow your userns:capability idea and then we probably can get
rid of the s_is_capable you hated so much :)

But,

> I think we may be able to separate those two questions.
> 
> The important idea is that we don't need to implement filesystem changes
> in the first pass.  Just have the permission check fail unconditionally
> if we are not in the init_user_ns.
> 
> Eric

There I definately agree.

However, given this general direction, do you feel that marking an sb
with a userns the way I'm doing in patch 4/10 makes sense, using it
only to facilitate a basic "is this non-userns-aware fs in the same
userns as the task attempting to access it" check?  If not, what
alternatives do we have, given that marking vfsmounts as we've
discussed before won't be feasible?

Of course by the time we get capabilities straightened out I
suppose tagging vfsmounts might in fact be feasible  :-)

Thanks, Eric.

-serge

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [0/10] User namespaces: introduction
       [not found]         ` <20080823011731.GA22737-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-23  3:19           ` Eric W. Biederman
       [not found]             ` <m1sksw770k.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Eric W. Biederman @ 2008-08-23  3:19 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> It definately seems to make sense in terms of the security
> implications.  And solving this before the filesystem handlers seems
> to make sense too.  Although I would like to get the first 3 patches upstream
> pretty soon, as I believe they are proper fixes.

Reasonable.  I'm not certain about free_user continuing to be an inline
function as it seems a bit non-trivial, but otherwise that sounds correct.

> But wrt userns:capability, the problem that brings to mind is that of
> referring to the userns.  Do we use the userspace-exported id, or do we
> use the actual in-kernel user_ns?  If we use the in-kernel user_ns,
> then we'd have to take a ref for each cap, yuck.  But you had wanted to
> use 'mount' to only have filesystems associate userspace ids with the
> in-kernel struct user_ns, so that complicates the idea of having
> capabilities refer to those.

I don't think so.  In the standard security model there are only 2
intersections between the filesystem and the capabilities.

- CAP_DAC_OVERRIDE.
- The capabilities xattr on a filesystem.

With a filesystem in exactly one user namespace at a time this
is straight forward.  With a filesystem in user namespaces at
a time this is slightly more interesting.

I believe the authentication algorithm becomes:
Map the credentials on the filesystem inode into (fs_user_ns, fs_uid, fs_gid, fs_mode)

Then to see if we have power over the file we test:
capable(fs_user_ns, CAP_DAC_OVERRIDE).

Then if current->user->user_ns != fs_user_ns we can do something like:
uid = 0, gid = 0 mode clear except for the other bits.  We want either
0 or another uid we have reserved for the purpose.

I don't see why the mapping rules should not be universal so we can probably
do all of the mapping foreign uid's and gid's in generic code and just
place a unser_ns pointer into struct inode. 

Which makes things very close to how they are now and it means we can
do the lookup of the user_ns when we cache the struct inode.

> Anyway I like the overall approach, and will think a bit about
> any other actual implementation issues.

Thanks.

It adds more complications then I like not having a view of the filesystem
with a single user_namespace.  However that appears to be necessary to
deal with Al's inode_permission changes, and it seems to be where we
are ultimately heading so it seems more honest.  So I guess I have
to bite the bullet and accept it. ;)

For the case of a shared /usr just having other permission access
should work fairly well.  I just looked on my ubuntu system and I
found only 36 suid executables and only one executable (fusermount)
that was not world executable.  And a shared usr is the only reasonable
case I could think of where I would want a file to at least appear
to have multiple owners.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 04/10] user namespaces: enforce usernamespaces for file permission
       [not found]                     ` <20080823022210.GA29618-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-23  3:41                       ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2008-08-23  3:41 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

>> There are two questions.
>> Does this filesystem provide mappings to user namespace X?
>> What is the mapping from this filesystem to user namespace X?
>
> That is where you and I still disagree:  I don't believe that a mapping
> as such makes sense.
>
> A mapping implies "uid 500 becomes uid 0, uid 501 becomes uid 1000",
> etc.  That simply is not sufficient.  If I am going to be able to create
> a new userns without privilege, then all files belonging to all uids in
> the child userns must be owned by uid 500 in my parent uidns.

> So I do think we will have to have something more of the format "what is
> the owning uid and gid of this inode, in the context of user namespace X."
> Which was the s_convert_uid_gid() in my helpers.

That is what I meant by mapping.  ;)

Given my thought in my last message that we should cache the user_ns
in the inode along with the rest of the credentials.  Making all of
the child files appear and act as if they are owned by uid 500 in the
parent namespace should be straight forward.

I think we may want a 1-1 mapping onto the filesystem.  However I
don't care very much how we store uids on the filesystem as long
as we do something that works.

> If we follow your userns:capability idea and then we probably can get
> rid of the s_is_capable you hated so much :)

yep.

> But,
>
>> I think we may be able to separate those two questions.
>> 
>> The important idea is that we don't need to implement filesystem changes
>> in the first pass.  Just have the permission check fail unconditionally
>> if we are not in the init_user_ns.
>> 
>> Eric
>
> There I definately agree.
>
> However, given this general direction, do you feel that marking an sb
> with a userns the way I'm doing in patch 4/10 makes sense, using it
> only to facilitate a basic "is this non-userns-aware fs in the same
> userns as the task attempting to access it" check?  If not, what
> alternatives do we have, given that marking vfsmounts as we've
> discussed before won't be feasible?
>
> Of course by the time we get capabilities straightened out I
> suppose tagging vfsmounts might in fact be feasible  :-)

LOL

Well I tell you what.  Unless we can think of a reason for a
file to live in more than one leaf user namespace.  We should
tag the inode with the leaf user namespace.

We could tag the superblock with a default for the inode user
namespace but that is less interesting.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [0/10] User namespaces: introduction
       [not found]             ` <m1sksw770k.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-08-25 19:51               ` Serge E. Hallyn
       [not found]                 ` <20080825195124.GA9361-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 24+ messages in thread
From: Serge E. Hallyn @ 2008-08-25 19:51 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Linux Containers

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > It definately seems to make sense in terms of the security
> > implications.  And solving this before the filesystem handlers seems
> > to make sense too.  Although I would like to get the first 3 patches upstream
> > pretty soon, as I believe they are proper fixes.
> 
> Reasonable.  I'm not certain about free_user continuing to be an inline
> function as it seems a bit non-trivial, but otherwise that sounds correct.

Great, I'll fix that and resend and ask for inclusion.

So based on your input, here is how I'm seeing the next iteration of
usernamespace-filesystem interaction semantics:

if
	0=init_user_ns
	(X,Y) = (userns X, uid Y)
	(0,500) creates (1,0) and (1,1000)
	(1,1000) creates a file /foo/bar
	then
		inode->i_uid = 1000
		inode->i_userns = 1  (we use the mount-provided userns, right?)
			i_userns storing is per-fs, but probably uses xattr)
		the fs stores the fact that (0,500) owns userns 1
			this might be stored just in /etc/userns.conf,
			and parsed at mount time)
	 when (1,1001) looks up /foo/bar, he sees owner=1000
	 when (0,501) looks up /foo/bar, he sees owner=500
	 when (0,501) creates (2,0) and (2,0) looksup /foo/bar,
		he sees owner=0, mode bits clear except the 'other' bits

Put user_ns in struct inode so simple userid mapping can be done
in generic code.

Here is a weirdness:  If (0,500) creates some files as (1,1000)
under /home/hallyn/containers/vs1.  Now the system is rebooted, and the
/etc/userns.conf for some reason is not loaded.  Now when hallyn does
ls /home/hallyn/containers/vs1, he sees files owned by (0,0), with
only the 'other' permissions.  Now he can't make them setuid root so
it's no vulnerability.  Just a wart.

Is that what you had in mind?

But I'll still look at doing capabilities first like you were
saying.

thanks,
-serge

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [0/10] User namespaces: introduction
       [not found]                 ` <20080825195124.GA9361-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-29  9:40                   ` Eric W. Biederman
  0 siblings, 0 replies; 24+ messages in thread
From: Eric W. Biederman @ 2008-08-29  9:40 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
>> 
>> > It definately seems to make sense in terms of the security
>> > implications.  And solving this before the filesystem handlers seems
>> > to make sense too.  Although I would like to get the first 3 patches
> upstream
>> > pretty soon, as I believe they are proper fixes.
>> 
>> Reasonable.  I'm not certain about free_user continuing to be an inline
>> function as it seems a bit non-trivial, but otherwise that sounds correct.
>
> Great, I'll fix that and resend and ask for inclusion.
>
> So based on your input, here is how I'm seeing the next iteration of
> usernamespace-filesystem interaction semantics:
>
> if
> 	0=init_user_ns
> 	(X,Y) = (userns X, uid Y)
> 	(0,500) creates (1,0) and (1,1000)
> 	(1,1000) creates a file /foo/bar
> 	then
> 		inode->i_uid = 1000
> 		inode->i_userns = 1  (we use the mount-provided userns, right?)
> 			i_userns storing is per-fs, but probably uses xattr)

However we get that information.  Caching it per inode looks like the right
way to go.

> 		the fs stores the fact that (0,500) owns userns 1
> 			this might be stored just in /etc/userns.conf,
> 			and parsed at mount time)

> 	 when (1,1001) looks up /foo/bar, he sees owner=1000
> 	 when (0,501) looks up /foo/bar, he sees owner=500
> 	 when (0,501) creates (2,0) and (2,0) looksup /foo/bar,
> 		he sees owner=0, mode bits clear except the 'other' bits

Sounds right.  Either that or we reserve a different uid for the purpose.
Still 0 appears to the be the traditional owner of unknown users in 
tarballs so it should work as long as don't accidentally grant privilege.

> Put user_ns in struct inode so simple userid mapping can be done
> in generic code.

Yep.

> Here is a weirdness:  If (0,500) creates some files as (1,1000)
> under /home/hallyn/containers/vs1.  Now the system is rebooted, and the
> /etc/userns.conf for some reason is not loaded.  Now when hallyn does
> ls /home/hallyn/containers/vs1, he sees files owned by (0,0), with
> only the 'other' permissions.  Now he can't make them setuid root so
> it's no vulnerability.  Just a wart.

Yep.

> Is that what you had in mind?

Yep.  That sounds right.  I think we have a little ways to go before we
have the persistence side of things liked but the in kernel side of the
semantics sounds correct.

> But I'll still look at doing capabilities first like you were
> saying.

Thanks.  That looks like a lot fewer conflicts.

Eric

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-08-29  9:40 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-22 19:45 [0/10] User namespaces: introduction Serge E. Hallyn
     [not found] ` <20080822194513.GA10262-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-22 19:45   ` [PATCH 01/10] user namespaces: introduce user_struct->user_namespace relationship Serge E. Hallyn
2008-08-22 19:45   ` [PATCH 02/10] user namespaces: move user_ns from nsproxy into user struct Serge E. Hallyn
2008-08-22 19:45   ` [PATCH 03/10] user namespaces: reset task's credentials on CLONE_NEWUSER Serge E. Hallyn
2008-08-22 19:46   ` [PATCH 04/10] user namespaces: enforce user namespaces for file permission Serge E. Hallyn
     [not found]     ` <20080822194609.GD10360-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-22 20:13       ` Eric W. Biederman
     [not found]         ` <m1ej4glsen.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-08-23  0:57           ` Serge E. Hallyn
     [not found]             ` <20080823005715.GB21064-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-23  2:16               ` Eric W. Biederman
2008-08-22 21:13       ` Eric W. Biederman
     [not found]         ` <m1bpzkhhy0.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-08-23  0:53           ` [PATCH 04/10] user namespaces: enforce usernamespaces " Serge E. Hallyn
     [not found]             ` <20080823005304.GA21064-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-23  1:56               ` Eric W. Biederman
     [not found]                 ` <m1r68gebop.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-08-23  2:22                   ` Serge E. Hallyn
     [not found]                     ` <20080823022210.GA29618-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-23  3:41                       ` Eric W. Biederman
2008-08-22 19:46   ` [PATCH 05/10] user namespaces: Allow registering new usernamespaces using mount Serge E. Hallyn
2008-08-22 19:46   ` [PATCH 06/10] user namespaces: hook fs/attr.c Serge E. Hallyn
2008-08-22 19:46   ` [PATCH 07/10] user namespaces: bad bad bad but test code Serge E. Hallyn
2008-08-22 19:47   ` [PATCH 08/10] userns: store child userns uids as xattrs in ext3 using lib/fsuserns Serge E. Hallyn
2008-08-22 19:47   ` [PATCH 09/10] userns: have ext3 use fsuserns to read userns xattrs, and add groups to userns Serge E. Hallyn
2008-08-22 19:47   ` [PATCH 10/10] userns: add support for readdir Serge E. Hallyn
2008-08-22 20:41   ` [0/10] User namespaces: introduction Eric W. Biederman
     [not found]     ` <m1d4k0ixzp.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-08-23  1:17       ` Serge E. Hallyn
     [not found]         ` <20080823011731.GA22737-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-23  3:19           ` Eric W. Biederman
     [not found]             ` <m1sksw770k.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-08-25 19:51               ` Serge E. Hallyn
     [not found]                 ` <20080825195124.GA9361-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-29  9:40                   ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.