[PATCH -mm 0/7] execns syscall and user namespace

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH -mm 0/7] execns syscall and user namespace
@ 2006-07-11  7:50 Cedric Le Goater
  2006-07-11  7:50 ` [PATCH -mm 1/7] add execns syscall core routine Cedric Le Goater
                   ` (9 more replies)
  0 siblings, 10 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Eric W. Biederman, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Hello !

The following patchset adds the user namespace and a new syscall execns.

The user namespace will allow a process to unshare its user_struct table,
resetting at the same time its own user_struct and all the associated
accounting.

The purpose of execns is to make sure that a process unsharing a namespace
is free from any reference in the previous namespace. the execve() semantic
seems to be the best candidate as it already flushes the previous process
context.

Thanks for reviewing, sharing, flaming !

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH -mm 1/7] add execns syscall core routine
  2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
@ 2006-07-11  7:50 ` Cedric Le Goater
  2006-07-11  7:50 ` [PATCH -mm 2/7] add execns syscall to s390 Cedric Le Goater
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Cedric Le Goater, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

[-- Attachment #1: execns-syscall-core.patch --]
[-- Type: text/plain, Size: 6623 bytes --]

This patch adds the execns syscall core routine.

This new syscall is very similar to execve(). It takes an extra
CLONE_* flag argument which defines which namespaces are unshared
during the execve() call.

The purpose of such a syscall is to make sure that a process unsharing
a namespace is free from any reference in the previous namespace. the
execve() semantic seems to be the best candidate as it already flushes
the previous process context.

The purpose of flush_all_old_files() is to close *all* files even the
files without the close-on-exec flag. To be done. 

sample user program : 

#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>

#include <linux/unistd.h>

#ifndef __NR_execns
#if __i386__
#    define __NR_execns 318
#elif __x86_64__
#    define __NR_execns 280
#elif __s390x__
#    define __NR_execns 310
#else
#    error "Architecture not supported"
#endif
#endif

static inline _syscall4(int,execns,int,flags,const char *,file,char **,argv,char **,envp)

#ifndef CLONE_NEWIPC
#define CLONE_NEWIPC	0x08000000
#endif

#ifndef CLONE_NEWUSER
#define CLONE_NEWUSER	0x10000000
#endif

static void usage(const char *name)
{
    printf("usage: %s [-iu] <command>\n", name);
    printf("\t-i : unshare ipc namespace.\n");
    printf("\t-u : unshare user namespace.\n");
    printf("\n");
    printf("(C) Copyright IBM Corp, 2006\n");
    printf("\n");
    exit(1);
}

int main(int argc, char* argv[])
{
    int flags = 0;
    int c;

    while ((c = getopt(argc, argv, "+iuh")) != EOF) {
	switch (c) {
	case 'i': flags |= CLONE_NEWIPC; break;
	case 'u': flags |= CLONE_NEWUSER; break;
	case 'h':
	default:
	    usage(argv[0]);
	}
    };
    
    argv = &argv[optind];
    argc = argc - optind;
	
    execns(flags, argv[0], argv, __environ);
    fprintf(stderr, "execns(%s) : %s\n", argv[0], strerror(errno));
    return 1;
}


Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>

---
 fs/exec.c                |   82 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h    |    1 
 include/linux/syscalls.h |    3 +
 kernel/sys_ni.c          |    2 +
 4 files changed, 88 insertions(+)

Index: 2.6.18-rc1-mm1/fs/exec.c
===================================================================
--- 2.6.18-rc1-mm1.orig/fs/exec.c
+++ 2.6.18-rc1-mm1/fs/exec.c
@@ -49,6 +49,7 @@
 #include <linux/acct.h>
 #include <linux/audit.h>
 #include <linux/notifier.h>
+#include <linux/user.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -805,6 +806,11 @@ static void flush_old_files(struct files
 	spin_unlock(&files->file_lock);
 }
 
+static void flush_all_old_files(struct files_struct * files)
+{
+	/* flush it all even close_on_exec == 0 */
+}
+
 void get_task_comm(char *buf, struct task_struct *tsk)
 {
 	/* buf must be at least sizeof(tsk->comm) in size */
@@ -1235,6 +1241,82 @@ out_ret:
 	return retval;
 }
 
+/*
+ * sys_execns() executes a new program and unshares selected
+ * namespaces.
+ */
+int do_execns(int unshare_flags, char * filename,
+	char __user *__user *argv,
+	char __user *__user *envp,
+	struct pt_regs * regs)
+{
+	int err = 0;
+	struct nsproxy *new_nsproxy = NULL, *old_nsproxy = NULL;
+	struct uts_namespace *uts, *new_uts = NULL;
+	struct ipc_namespace *ipc, *new_ipc = NULL;
+
+	err = unshare_utsname(unshare_flags, &new_uts);
+	if (err)
+		goto bad_execns_out;
+	err = unshare_ipcs(unshare_flags, &new_ipc);
+	if (err)
+		goto bad_execns_cleanup_uts;
+
+	if (new_uts || new_ipc) {
+		old_nsproxy = current->nsproxy;
+		new_nsproxy = dup_namespaces(old_nsproxy);
+		if (!new_nsproxy) {
+			err = -ENOMEM;
+			goto bad_execns_cleanup_ipc;
+		}
+	}
+
+	err = do_execve(filename, argv, envp, regs);
+	if (err)
+		goto bad_execns_cleanup_ipc;
+
+	/* make sure all files are flushed */
+	flush_all_old_files(current->files);
+
+	if (new_uts || new_ipc) {
+
+		task_lock(current);
+
+		if (new_nsproxy) {
+			current->nsproxy = new_nsproxy;
+			new_nsproxy = old_nsproxy;
+		}
+
+		if (new_uts) {
+			uts = current->nsproxy->uts_ns;
+			current->nsproxy->uts_ns = new_uts;
+			new_uts = uts;
+		}
+
+		if (new_ipc) {
+			ipc = current->nsproxy->ipc_ns;
+			current->nsproxy->ipc_ns = new_ipc;
+			new_ipc = ipc;
+		}
+
+		task_unlock(current);
+	}
+
+	if (new_nsproxy)
+		put_nsproxy(new_nsproxy);
+
+bad_execns_cleanup_ipc:
+	if (new_ipc)
+		put_ipc_ns(new_ipc);
+
+bad_execns_cleanup_uts:
+	if (new_uts)
+		put_uts_ns(new_uts);
+
+bad_execns_out:
+	return err;
+}
+
 int set_binfmt(struct linux_binfmt *new)
 {
 	struct linux_binfmt *old = current->binfmt;
Index: 2.6.18-rc1-mm1/include/linux/sched.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/linux/sched.h
+++ 2.6.18-rc1-mm1/include/linux/sched.h
@@ -1335,6 +1335,7 @@ extern int disallow_signal(int);
 extern struct task_struct *child_reaper;
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
+extern int do_execns(int, char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 
Index: 2.6.18-rc1-mm1/include/linux/syscalls.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/linux/syscalls.h
+++ 2.6.18-rc1-mm1/include/linux/syscalls.h
@@ -64,6 +64,7 @@ struct robust_list_head;
 #include <asm/signal.h>
 #include <linux/quota.h>
 #include <linux/key.h>
+#include <asm/ptrace.h>
 
 asmlinkage long sys_time(time_t __user *tloc);
 asmlinkage long sys_stime(time_t __user *tptr);
@@ -597,4 +598,6 @@ asmlinkage long sys_get_robust_list(int 
 asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
 				    size_t len);
 
+asmlinkage long sys_execns(int flags, char *name, char **argv, char **envp,
+			struct pt_regs regs);
 #endif
Index: 2.6.18-rc1-mm1/kernel/sys_ni.c
===================================================================
--- 2.6.18-rc1-mm1.orig/kernel/sys_ni.c
+++ 2.6.18-rc1-mm1/kernel/sys_ni.c
@@ -134,3 +134,5 @@ cond_syscall(sys_madvise);
 cond_syscall(sys_mremap);
 cond_syscall(sys_remap_file_pages);
 cond_syscall(compat_sys_move_pages);
+
+cond_syscall(sys_execns);

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH -mm 2/7] add execns syscall to s390
  2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
  2006-07-11  7:50 ` [PATCH -mm 1/7] add execns syscall core routine Cedric Le Goater
@ 2006-07-11  7:50 ` Cedric Le Goater
  2006-07-11 13:44   ` Martin Schwidefsky
  2006-07-11 13:44   ` Martin Schwidefsky
  2006-07-11  7:50 ` [PATCH -mm 3/7] add execns syscall to x86_64 Cedric Le Goater
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Cedric Le Goater, Heiko Carstens,
	Martin Schwidefsky, Kirill Korotaev, Andrey Savochkin,
	Eric W. Biederman, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

[-- Attachment #1: execns-syscall-s390.patch --]
[-- Type: text/plain, Size: 5342 bytes --]

This patch adds the execns() syscall to the s390 architecture.

The 32bits syscall is not implemented. 

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>

--
 arch/s390/kernel/entry.S    |   10 ++++++++++
 arch/s390/kernel/entry64.S  |    8 ++++++++
 arch/s390/kernel/process.c  |   31 +++++++++++++++++++++++++++++++
 arch/s390/kernel/syscalls.S |    1 +
 include/asm-s390/unistd.h   |    3 ++-
 5 files changed, 52 insertions(+), 1 deletion(-)

Index: 2.6.18-rc1-mm1/arch/s390/kernel/entry.S
===================================================================
--- 2.6.18-rc1-mm1.orig/arch/s390/kernel/entry.S
+++ 2.6.18-rc1-mm1/arch/s390/kernel/entry.S
@@ -410,6 +410,15 @@ sys_execve_glue:        
         bnz     0(%r12)               # it did fail -> store result in gpr2
         b       4(%r12)               # SKIP ST 2,SP_R2(15) after BASR 14,8
                                       # in system_call/sysc_tracesys
+sys_execns_glue:
+        la      %r2,SP_PTREGS(%r15)   # load pt_regs
+        l       %r1,BASED(.Lexecns)
+	lr      %r12,%r14             # save return address
+        basr    %r14,%r1              # call sys_execns
+        ltr     %r2,%r2               # check if execns failed
+        bnz     0(%r12)               # it did fail -> store result in gpr2
+        b       4(%r12)               # SKIP ST 2,SP_R2(15) after BASR 14,8
+                                      # in system_call/sysc_tracesys
 
 sys_sigreturn_glue:     
         la      %r2,SP_PTREGS(%r15)   # load pt_regs as parameter
@@ -1024,6 +1033,7 @@ cleanup_io_leave_insn:
 .Lschedule:    .long  schedule
 .Lclone:       .long  sys_clone
 .Lexecve:      .long  sys_execve
+.Lexecns:      .long  sys_execns
 .Lfork:        .long  sys_fork
 .Lrt_sigreturn:.long  sys_rt_sigreturn
 .Lrt_sigsuspend:
Index: 2.6.18-rc1-mm1/arch/s390/kernel/process.c
===================================================================
--- 2.6.18-rc1-mm1.orig/arch/s390/kernel/process.c
+++ 2.6.18-rc1-mm1/arch/s390/kernel/process.c
@@ -343,6 +343,37 @@ out:
         return error;
 }
 
+/*
+ * sys_execns() executes a new program and unshares selected
+ * namespaces.
+ */
+asmlinkage long sys_execns(struct pt_regs regs)
+{
+        int error;
+	int flags;
+        char * filename;
+
+	flags = regs.orig_gpr2;
+        filename = getname((char __user *) regs.gprs[3]);
+        error = PTR_ERR(filename);
+        if (IS_ERR(filename))
+                goto out;
+        error = do_execns(flags, filename,
+			  (char __user * __user *) regs.gprs[4],
+			  (char __user * __user *) regs.gprs[5], &regs);
+	if (error == 0) {
+		task_lock(current);
+		current->ptrace &= ~PT_DTRACE;
+		task_unlock(current);
+		current->thread.fp_regs.fpc = 0;
+		if (MACHINE_HAS_IEEE)
+			asm volatile("sfpc %0,%0" : : "d" (0));
+	}
+        putname(filename);
+out:
+        return error;
+}
+
 
 /*
  * fill in the FPU structure for a core dump.
Index: 2.6.18-rc1-mm1/arch/s390/kernel/syscalls.S
===================================================================
--- 2.6.18-rc1-mm1.orig/arch/s390/kernel/syscalls.S
+++ 2.6.18-rc1-mm1/arch/s390/kernel/syscalls.S
@@ -318,3 +318,4 @@ SYSCALL(sys_splice,sys_splice,sys_splice
 SYSCALL(sys_sync_file_range,sys_sync_file_range,sys_sync_file_range_wrapper)
 SYSCALL(sys_tee,sys_tee,sys_tee_wrapper)
 SYSCALL(sys_vmsplice,sys_vmsplice,compat_sys_vmsplice_wrapper)
+SYSCALL(sys_execns_glue,sys_execns_glue,sys_execns_glue)
Index: 2.6.18-rc1-mm1/arch/s390/kernel/entry64.S
===================================================================
--- 2.6.18-rc1-mm1.orig/arch/s390/kernel/entry64.S
+++ 2.6.18-rc1-mm1/arch/s390/kernel/entry64.S
@@ -403,6 +403,14 @@ sys_execve_glue:        
         bnz     0(%r12)               # it did fail -> store result in gpr2
         b       6(%r12)               # SKIP STG 2,SP_R2(15) in
                                       # system_call/sysc_tracesys
+sys_execns_glue:
+        la      %r2,SP_PTREGS(%r15)   # load pt_regs
+	lgr     %r12,%r14             # save return address
+        brasl   %r14,sys_execns       # call sys_execns
+        ltgr    %r2,%r2               # check if execns failed
+        bnz     0(%r12)               # it did fail -> store result in gpr2
+        b       6(%r12)               # SKIP STG 2,SP_R2(15) in
+                                      # system_call/sysc_tracesys
 #ifdef CONFIG_COMPAT
 sys32_execve_glue:        
         la      %r2,SP_PTREGS(%r15)   # load pt_regs
Index: 2.6.18-rc1-mm1/include/asm-s390/unistd.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/asm-s390/unistd.h
+++ 2.6.18-rc1-mm1/include/asm-s390/unistd.h
@@ -302,8 +302,9 @@
 #define __NR_sync_file_range	307
 #define __NR_tee		308
 #define __NR_vmsplice		309
+#define __NR_execns		310
 
-#define NR_syscalls 310
+#define NR_syscalls 311
 
 /* 
  * There are some system calls that are not present on 64 bit, some

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 2/7] add execns syscall to s390
  2006-07-11  7:50 ` [PATCH -mm 2/7] add execns syscall to s390 Cedric Le Goater
@ 2006-07-11 13:44   ` Martin Schwidefsky
  2006-07-11 13:44   ` Martin Schwidefsky
  1 sibling, 0 replies; 107+ messages in thread
From: Martin Schwidefsky @ 2006-07-11 13:44 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Heiko Carstens, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

On Tue, 2006-07-11 at 09:50 +0200, Cedric Le Goater wrote: 
> The 32bits syscall is not implemented. 

The attached patch implements compat_do_execns (untested).

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.
--

[patch] Add execns compat function.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com> 

--

--
 fs/compat.c            |   83 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/compat.h |    2 +
 2 files changed, 85 insertions(+)

diff -urpN linux-2.6/fs/compat.c linux-2.6-execns/fs/compat.c
--- linux-2.6/fs/compat.c	2006-07-11 15:37:46.000000000 +0200
+++ linux-2.6-execns/fs/compat.c	2006-07-11 15:35:59.000000000 +0200
@@ -46,6 +46,8 @@
 #include <linux/rwsem.h>
 #include <linux/acct.h>
 #include <linux/mm.h>
+#include <linux/ipc.h>
+#include <linux/utsname.h>
 
 #include <net/sock.h>		/* siocdevprivate_ioctl */
 
@@ -1584,6 +1586,87 @@ out_ret:
 	return retval;
 }
 
+#ifdef CONFIG_UTS_NS
+
+static void flush_all_old_files(struct files_struct * files)
+{
+       /* flush it all even close_on_exec == 0 */
+}
+
+int compat_do_execns(int unshare_flags, char * filename,
+		     compat_uptr_t __user *argv,
+		     compat_uptr_t __user *envp,
+		     struct pt_regs * regs)
+{
+       int err = 0;
+       struct nsproxy *new_nsproxy = NULL, *old_nsproxy = NULL;
+       struct uts_namespace *uts, *new_uts = NULL;
+       struct ipc_namespace *ipc, *new_ipc = NULL;
+
+       err = unshare_utsname(unshare_flags, &new_uts);
+       if (err)
+               goto bad_execns_out;
+       err = unshare_ipcs(unshare_flags, &new_ipc);
+       if (err)
+               goto bad_execns_cleanup_uts;
+
+       if (new_uts || new_ipc) {
+               old_nsproxy = current->nsproxy;
+               new_nsproxy = dup_namespaces(old_nsproxy);
+               if (!new_nsproxy) {
+                       err = -ENOMEM;
+                       goto bad_execns_cleanup_ipc;
+               }
+       }
+
+       err = compat_do_execve(filename, argv, envp, regs);
+       if (err)
+               goto bad_execns_cleanup_ipc;
+
+       /* make sure all files are flushed */
+       flush_all_old_files(current->files);
+
+       if (new_uts || new_ipc) {
+
+               task_lock(current);
+
+               if (new_nsproxy) {
+                       current->nsproxy = new_nsproxy;
+                       new_nsproxy = old_nsproxy;
+               }
+
+               if (new_uts) {
+                       uts = current->nsproxy->uts_ns;
+                       current->nsproxy->uts_ns = new_uts;
+                       new_uts = uts;
+               }
+
+               if (new_ipc) {
+                       ipc = current->nsproxy->ipc_ns;
+                       current->nsproxy->ipc_ns = new_ipc;
+                       new_ipc = ipc;
+               }
+
+               task_unlock(current);
+       }
+
+       if (new_nsproxy)
+               put_nsproxy(new_nsproxy);
+
+bad_execns_cleanup_ipc:
+       if (new_ipc)
+               put_ipc_ns(new_ipc);
+
+bad_execns_cleanup_uts:
+       if (new_uts)
+               put_uts_ns(new_uts);
+
+bad_execns_out:
+       return err;
+}
+
+#endif /* CONFIG_UTS_NS */
+
 #define __COMPAT_NFDBITS       (8 * sizeof(compat_ulong_t))
 
 #define ROUND_UP(x,y) (((x)+(y)-1)/(y))
diff -urpN linux-2.6/include/linux/compat.h linux-2.6-execns/include/linux/compat.h
--- linux-2.6/include/linux/compat.h	2006-07-11 15:37:46.000000000 +0200
+++ linux-2.6-execns/include/linux/compat.h	2006-07-11 14:54:56.000000000 +0200
@@ -185,6 +185,8 @@ asmlinkage ssize_t compat_sys_writev(uns
 
 int compat_do_execve(char * filename, compat_uptr_t __user *argv,
 	        compat_uptr_t __user *envp, struct pt_regs * regs);
+int compat_do_execns(int flags, char * filename, compat_uptr_t __user *argv,
+	        compat_uptr_t __user *envp, struct pt_regs * regs);
 
 asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
 		compat_ulong_t __user *outp, compat_ulong_t __user *exp,



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 2/7] add execns syscall to s390
  2006-07-11  7:50 ` [PATCH -mm 2/7] add execns syscall to s390 Cedric Le Goater
  2006-07-11 13:44   ` Martin Schwidefsky
@ 2006-07-11 13:44   ` Martin Schwidefsky
  2006-07-11 14:44     ` Cedric Le Goater
  1 sibling, 1 reply; 107+ messages in thread
From: Martin Schwidefsky @ 2006-07-11 13:44 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Heiko Carstens, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

On Tue, 2006-07-11 at 09:50 +0200, Cedric Le Goater wrote: 
> This patch adds the execns() syscall to the s390 architecture.

Fixed whitespace and added glue code for compat_do_execns.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.

--

This patch adds the execns() syscall to the s390 architecture.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>

--- 
--
 arch/s390/kernel/compat_linux.c |   29 +++++++++++++++++++++++++++++
 arch/s390/kernel/entry.S        |   10 ++++++++++
 arch/s390/kernel/entry64.S      |   18 ++++++++++++++++++
 arch/s390/kernel/process.c      |   34 ++++++++++++++++++++++++++++++++++
 arch/s390/kernel/syscalls.S     |    1 +
 include/asm-s390/unistd.h       |    3 ++-
 kernel/sys_ni.c                 |    1 +
 7 files changed, 95 insertions(+), 1 deletion(-)

diff -urpN linux-2.6/arch/s390/kernel/compat_linux.c linux-2.6-execns/arch/s390/kernel/compat_linux.c
--- linux-2.6/arch/s390/kernel/compat_linux.c	2006-07-11 14:06:59.000000000 +0200
+++ linux-2.6-execns/arch/s390/kernel/compat_linux.c	2006-07-11 15:21:20.000000000 +0200
@@ -551,6 +551,35 @@ out:
         return error;
 }
 
+#ifdef CONFIG_UTS_NS
+
+asmlinkage long sys32_execns(struct pt_regs regs)
+{
+	int error;
+	int flags;
+	char * filename;
+
+	flags = regs.orig_gpr2;
+	filename = getname(compat_ptr(regs.gprs[3]));
+	error = PTR_ERR(filename);
+	if (IS_ERR(filename))
+		goto out;
+	error = compat_do_execns(flags, filename, compat_ptr(regs.gprs[3]),
+				 compat_ptr(regs.gprs[4]), &regs);
+	if (error == 0) {
+		task_lock(current);
+		current->ptrace &= ~PT_DTRACE;
+		task_unlock(current);
+		current->thread.fp_regs.fpc = 0;
+		asm volatile("sfpc %0,%0" : : "d" (0));
+	}
+	putname(filename);
+out:
+	return error;
+}
+
+#endif /* CONFIG_UTS_NS */
+
 
 #ifdef CONFIG_MODULES
 
diff -urpN linux-2.6/arch/s390/kernel/entry64.S linux-2.6-execns/arch/s390/kernel/entry64.S
--- linux-2.6/arch/s390/kernel/entry64.S	2006-07-11 14:06:59.000000000 +0200
+++ linux-2.6-execns/arch/s390/kernel/entry64.S	2006-07-11 15:02:50.000000000 +0200
@@ -403,6 +403,15 @@ sys_execve_glue:        
         bnz     0(%r12)               # it did fail -> store result in gpr2
         b       6(%r12)               # SKIP STG 2,SP_R2(15) in
                                       # system_call/sysc_tracesys
+sys_execns_glue:
+	la	%r2,SP_PTREGS(%r15)   # load pt_regs
+	lgr	%r12,%r14	      # save return address
+	brasl	%r14,sys_execns       # call sys_execns
+	ltgr	%r2,%r2		      # check if execns failed
+	bnz	0(%r12)		      # it did fail -> store result in gpr2
+	b	6(%r12)		      # SKIP STG 2,SP_R2(15) in
+				      # system_call/sysc_tracesys
+
 #ifdef CONFIG_COMPAT
 sys32_execve_glue:        
         la      %r2,SP_PTREGS(%r15)   # load pt_regs
@@ -412,6 +421,15 @@ sys32_execve_glue:        
         bnz     0(%r12)               # it did fail -> store result in gpr2
         b       6(%r12)               # SKIP STG 2,SP_R2(15) in
                                       # system_call/sysc_tracesys
+
+sys32_execns_glue:
+	la	%r2,SP_PTREGS(%r15)   # load pt_regs
+	lgr	%r12,%r14	      # save return address
+	brasl	%r14,sys32_execns     # call sys_execns
+	ltgr	%r2,%r2		      # check if execns failed
+	bnz	0(%r12)		      # it did fail -> store result in gpr2
+	b	6(%r12)		      # SKIP STG 2,SP_R2(15) in
+				      # system_call/sysc_tracesys
 #endif
 
 sys_sigreturn_glue:     
diff -urpN linux-2.6/arch/s390/kernel/entry.S linux-2.6-execns/arch/s390/kernel/entry.S
--- linux-2.6/arch/s390/kernel/entry.S	2006-07-11 14:06:59.000000000 +0200
+++ linux-2.6-execns/arch/s390/kernel/entry.S	2006-07-11 15:02:50.000000000 +0200
@@ -410,6 +410,15 @@ sys_execve_glue:        
         bnz     0(%r12)               # it did fail -> store result in gpr2
         b       4(%r12)               # SKIP ST 2,SP_R2(15) after BASR 14,8
                                       # in system_call/sysc_tracesys
+sys_execns_glue:
+	la	%r2,SP_PTREGS(%r15)   # load pt_regs
+	l	%r1,BASED(.Lexecns)
+	lr	%r12,%r14	      # save return address
+	basr	%r14,%r1	      # call sys_execns
+	ltr	%r2,%r2		      # check if execns failed
+	bnz	0(%r12)		      # it did fail -> store result in gpr2
+	b	4(%r12)		      # SKIP ST 2,SP_R2(15) after BASR 14,8
+                                      # in system_call/sysc_tracesys
 
 sys_sigreturn_glue:     
         la      %r2,SP_PTREGS(%r15)   # load pt_regs as parameter
@@ -1024,6 +1033,7 @@ cleanup_io_leave_insn:
 .Lschedule:    .long  schedule
 .Lclone:       .long  sys_clone
 .Lexecve:      .long  sys_execve
+.Lexecns:      .long  sys_execns
 .Lfork:        .long  sys_fork
 .Lrt_sigreturn:.long  sys_rt_sigreturn
 .Lrt_sigsuspend:
diff -urpN linux-2.6/arch/s390/kernel/process.c linux-2.6-execns/arch/s390/kernel/process.c
--- linux-2.6/arch/s390/kernel/process.c	2006-07-11 14:06:59.000000000 +0200
+++ linux-2.6-execns/arch/s390/kernel/process.c	2006-07-11 15:22:47.000000000 +0200
@@ -343,6 +343,40 @@ out:
         return error;
 }
 
+#ifdef CONFIG_UTS_NS
+
+/*
+ * sys_execns() executes a new program and unshares selected
+ * namespaces.
+ */
+asmlinkage long sys_execns(struct pt_regs regs)
+{
+	int error;
+	int flags;
+	char * filename;
+
+	flags = regs.orig_gpr2;
+	filename = getname((char __user *) regs.gprs[3]);
+	error = PTR_ERR(filename);
+	if (IS_ERR(filename))
+		goto out;
+	error = do_execns(flags, filename,
+			  (char __user * __user *) regs.gprs[4],
+			  (char __user * __user *) regs.gprs[5], &regs);
+	if (error == 0) {
+		task_lock(current);
+		current->ptrace &= ~PT_DTRACE;
+		task_unlock(current);
+		current->thread.fp_regs.fpc = 0;
+		if (MACHINE_HAS_IEEE)
+			asm volatile("sfpc %0,%0" : : "d" (0));
+	}
+	putname(filename);
+out:
+	return error;
+}
+
+#endif /* CONFIG_UTS_NS */
 
 /*
  * fill in the FPU structure for a core dump.
diff -urpN linux-2.6/arch/s390/kernel/syscalls.S linux-2.6-execns/arch/s390/kernel/syscalls.S
--- linux-2.6/arch/s390/kernel/syscalls.S	2006-06-18 03:49:35.000000000 +0200
+++ linux-2.6-execns/arch/s390/kernel/syscalls.S	2006-07-11 15:02:50.000000000 +0200
@@ -318,3 +318,4 @@ SYSCALL(sys_splice,sys_splice,sys_splice
 SYSCALL(sys_sync_file_range,sys_sync_file_range,sys_sync_file_range_wrapper)
 SYSCALL(sys_tee,sys_tee,sys_tee_wrapper)
 SYSCALL(sys_vmsplice,sys_vmsplice,compat_sys_vmsplice_wrapper)
+SYSCALL(sys_execns_glue,sys_execns_glue,sys32_execns_glue)
diff -urpN linux-2.6/include/asm-s390/unistd.h linux-2.6-execns/include/asm-s390/unistd.h
--- linux-2.6/include/asm-s390/unistd.h	2006-07-11 14:07:43.000000000 +0200
+++ linux-2.6-execns/include/asm-s390/unistd.h	2006-07-11 15:02:50.000000000 +0200
@@ -302,8 +302,9 @@
 #define __NR_sync_file_range	307
 #define __NR_tee		308
 #define __NR_vmsplice		309
+#define __NR_execns		310
 
-#define NR_syscalls 310
+#define NR_syscalls 311
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff -urpN linux-2.6/kernel/sys_ni.c linux-2.6-execns/kernel/sys_ni.c
--- linux-2.6/kernel/sys_ni.c	2006-07-11 14:11:23.000000000 +0200
+++ linux-2.6-execns/kernel/sys_ni.c	2006-07-11 15:21:44.000000000 +0200
@@ -121,6 +121,7 @@ cond_syscall(sys32_sysctl);
 cond_syscall(ppc_rtas);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
+cond_syscall(sys32_execns);
 
 /* mmu depending weak syscall entries */
 cond_syscall(sys_mprotect);



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 2/7] add execns syscall to s390
  2006-07-11 13:44   ` Martin Schwidefsky
@ 2006-07-11 14:44     ` Cedric Le Goater
  2006-07-11 14:54       ` Martin Schwidefsky
  0 siblings, 1 reply; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11 14:44 UTC (permalink / raw)
  To: schwidefsky
  Cc: linux-kernel, Andrew Morton, Heiko Carstens, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

Martin Schwidefsky wrote:
> On Tue, 2006-07-11 at 09:50 +0200, Cedric Le Goater wrote: 
>> This patch adds the execns() syscall to the s390 architecture.
> 
> Fixed whitespace and added glue code for compat_do_execns.
> 

Thanks !

Both patches are in my patchset now which compiles and boots fine on s390x.
I'll build 32 bits binaries to try them.

Why did you protect sys32_execns, sys_execns and compat_do_execns () with
#ifdef CONFIG_UTS_NS ? Did you need to ?

C?

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 2/7] add execns syscall to s390
  2006-07-11 14:44     ` Cedric Le Goater
@ 2006-07-11 14:54       ` Martin Schwidefsky
  2006-07-11 15:43         ` Cedric Le Goater
  0 siblings, 1 reply; 107+ messages in thread
From: Martin Schwidefsky @ 2006-07-11 14:54 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Heiko Carstens, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

On Tue, 2006-07-11 at 16:44 +0200, Cedric Le Goater wrote:
> Thanks !
> 
> Both patches are in my patchset now which compiles and boots fine on s390x.
> I'll build 32 bits binaries to try them.
> 
> Why did you protect sys32_execns, sys_execns and compat_do_execns () with
> #ifdef CONFIG_UTS_NS ? Did you need to ?

Yes, without the #ifdefs I get compile time warnings.

-- 
blue skies,
  Martin.

Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH

"Reality continues to ruin my life." - Calvin.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 2/7] add execns syscall to s390
  2006-07-11 14:54       ` Martin Schwidefsky
@ 2006-07-11 15:43         ` Cedric Le Goater
  0 siblings, 0 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11 15:43 UTC (permalink / raw)
  To: schwidefsky
  Cc: linux-kernel, Andrew Morton, Heiko Carstens, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

Martin Schwidefsky wrote:
> On Tue, 2006-07-11 at 16:44 +0200, Cedric Le Goater wrote:
>> Thanks !
>>
>> Both patches are in my patchset now which compiles and boots fine on s390x.
>> I'll build 32 bits binaries to try them.
>>
>> Why did you protect sys32_execns, sys_execns and compat_do_execns () with
>> #ifdef CONFIG_UTS_NS ? Did you need to ?
> 
> Yes, without the #ifdefs I get compile time warnings.


fs/exec.c: In function `do_execns':
fs/exec.c:1262: warning: implicit declaration of function `unshare_ipcs'
fs/compat.c: In function `compat_do_execns':
fs/compat.c:1608: warning: implicit declaration of function `unshare_ipcs'

I think the issue comes from unshare_ipcs which needs a dummy version when
the namespace config options are not set. Here's a fix.

thanks,

C.

From: Cedric Le Goater <clg@fr.ibm.com>
Subject: export dummy unshare_ipcs when CONFIG_IPC_NS is not set

This patch exports a dummy unshare_ipcs when CONFIG_IPC_NS is not set.

This is needed by do_execns in fs/exec.c which also uses this
routine.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>

---
 include/linux/ipc.h |    3 ++-
 ipc/util.c          |    9 ++++++++-
 kernel/fork.c       |   10 ----------
 3 files changed, 10 insertions(+), 12 deletions(-)

Index: 2.6.18-rc1-mm1/include/linux/ipc.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/linux/ipc.h
+++ 2.6.18-rc1-mm1/include/linux/ipc.h
@@ -95,10 +95,11 @@ extern struct ipc_namespace init_ipc_ns;
 #define INIT_IPC_NS(ns)
 #endif

+extern int unshare_ipcs(unsigned long flags, struct ipc_namespace **ns);
+
 #ifdef CONFIG_IPC_NS
 extern void free_ipc_ns(struct kref *kref);
 extern int copy_ipcs(unsigned long flags, struct task_struct *tsk);
-extern int unshare_ipcs(unsigned long flags, struct ipc_namespace **ns);
 #else
 static inline int copy_ipcs(unsigned long flags, struct task_struct *tsk)
 {
Index: 2.6.18-rc1-mm1/kernel/fork.c
===================================================================
--- 2.6.18-rc1-mm1.orig/kernel/fork.c
+++ 2.6.18-rc1-mm1/kernel/fork.c
@@ -1575,16 +1575,6 @@ static int unshare_semundo(unsigned long
 	return 0;
 }

-#ifndef CONFIG_IPC_NS
-static inline int unshare_ipcs(unsigned long flags, struct ipc_namespace **ns)
-{
-	if (flags & CLONE_NEWIPC)
-		return -EINVAL;
-
-	return 0;
-}
-#endif
-
 /*
  * unshare allows a process to 'unshare' part of the process
  * context which was originally shared using clone.  copy_*
Index: 2.6.18-rc1-mm1/ipc/util.c
===================================================================
--- 2.6.18-rc1-mm1.orig/ipc/util.c
+++ 2.6.18-rc1-mm1/ipc/util.c
@@ -144,7 +144,14 @@ void free_ipc_ns(struct kref *kref)
 	shm_exit_ns(ns);
 	kfree(ns);
 }
-#endif
+#else
+int unshare_ipcs(unsigned long unshare_flags, struct ipc_namespace **new_ipc)
+{
+	if (unshare_flags & CLONE_NEWIPC)
+		return -EINVAL;
+	return 0;
+}
+#endif /* CONFIG_IPC_NS */

 /**
  *	ipc_init	-	initialise IPC subsystem



^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH -mm 3/7] add execns syscall to x86_64
  2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
  2006-07-11  7:50 ` [PATCH -mm 1/7] add execns syscall core routine Cedric Le Goater
  2006-07-11  7:50 ` [PATCH -mm 2/7] add execns syscall to s390 Cedric Le Goater
@ 2006-07-11  7:50 ` Cedric Le Goater
  2006-07-11  7:50 ` [PATCH -mm 4/7] add execns syscall to i386 Cedric Le Goater
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Cedric Le Goater, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

[-- Attachment #1: execns-syscall-x86_64.patch --]
[-- Type: text/plain, Size: 4164 bytes --]

This patch adds the execns() syscall to the x86_64 architecture.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>

---
 arch/x86_64/kernel/entry.S      |   30 ++++++++++++++++++++++++++++++
 arch/x86_64/kernel/functionlist |    3 +++
 arch/x86_64/kernel/process.c    |   25 +++++++++++++++++++++++++
 include/asm-x86_64/unistd.h     |    4 +++-
 4 files changed, 61 insertions(+), 1 deletion(-)

Index: 2.6.18-rc1-mm1/arch/x86_64/kernel/entry.S
===================================================================
--- 2.6.18-rc1-mm1.orig/arch/x86_64/kernel/entry.S
+++ 2.6.18-rc1-mm1/arch/x86_64/kernel/entry.S
@@ -453,6 +453,21 @@ ENTRY(stub_execve)
 	CFI_ENDPROC
 END(stub_execve)
 	
+ENTRY(stub_execns)
+	CFI_STARTPROC
+	popq %r11
+	CFI_ADJUST_CFA_OFFSET -8
+	CFI_REGISTER rip, r11
+	SAVE_REST
+	FIXUP_TOP_OF_STACK %r11
+	call sys_execns
+	RESTORE_TOP_OF_STACK %r11
+	movq %rax,RAX(%rsp)
+	RESTORE_REST
+	jmp int_ret_from_sys_call
+	CFI_ENDPROC
+END(stub_execns)
+
 /*
  * sigreturn is special because it needs to restore all registers on return.
  * This cannot be done with SYSRET, so use the IRET return path instead.
@@ -1014,6 +1029,21 @@ ENTRY(execve)
 	CFI_ENDPROC
 ENDPROC(execve)
 
+ENTRY(execns)
+	CFI_STARTPROC
+	FAKE_STACK_FRAME $0
+	SAVE_ALL
+	call sys_execns
+	movq %rax, RAX(%rsp)
+	RESTORE_REST
+	testq %rax,%rax
+	je int_ret_from_sys_call
+	RESTORE_ARGS
+	UNFAKE_STACK_FRAME
+	ret
+	CFI_ENDPROC
+ENDPROC(execns)
+
 KPROBE_ENTRY(page_fault)
 	errorentry do_page_fault
 END(page_fault)
Index: 2.6.18-rc1-mm1/arch/x86_64/kernel/functionlist
===================================================================
--- 2.6.18-rc1-mm1.orig/arch/x86_64/kernel/functionlist
+++ 2.6.18-rc1-mm1/arch/x86_64/kernel/functionlist
@@ -551,6 +551,7 @@
 *(.text.sys_getdents)
 *(.text.sys_dup)
 *(.text.stub_execve)
+*(.text.stub_execns)
 *(.text.sha_transform)
 *(.text.radix_tree_tag_clear)
 *(.text.put_unused_fd)
@@ -678,6 +679,7 @@
 *(.text.__find_symbol)
 *(.text.do_futex)
 *(.text.do_execve)
+*(.text.do_execns)
 *(.text.dirty_writeback_centisecs_handler)
 *(.text.dev_watchdog)
 *(.text.can_share_swap_page)
@@ -1098,6 +1100,7 @@
 *(.text.sys_fstat)
 *(.text.sysfs_readdir)
 *(.text.sys_execve)
+*(.text.sys_execns)
 *(.text.sysenter_tracesys)
 *(.text.sys_chown)
 *(.text.stub_clone)
Index: 2.6.18-rc1-mm1/arch/x86_64/kernel/process.c
===================================================================
--- 2.6.18-rc1-mm1.orig/arch/x86_64/kernel/process.c
+++ 2.6.18-rc1-mm1/arch/x86_64/kernel/process.c
@@ -697,6 +697,31 @@ long sys_execve(char __user *name, char 
 	return error;
 }
 
+/*
+ * sys_execns() executes a new program and unshares selected
+ * namespaces.
+ */
+asmlinkage
+long sys_execns(int flags, char __user *name, char __user * __user *argv,
+		char __user * __user *envp, struct pt_regs regs)
+{
+	long error;
+	char * filename;
+
+	filename = getname(name);
+	error = PTR_ERR(filename);
+	if (IS_ERR(filename))
+		return error;
+	error = do_execns(flags, filename, argv, envp, &regs);
+	if (error == 0) {
+		task_lock(current);
+		current->ptrace &= ~PT_DTRACE;
+		task_unlock(current);
+	}
+	putname(filename);
+	return error;
+}
+
 void set_personality_64bit(void)
 {
 	/* inherit personality from parent */
Index: 2.6.18-rc1-mm1/include/asm-x86_64/unistd.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/asm-x86_64/unistd.h
+++ 2.6.18-rc1-mm1/include/asm-x86_64/unistd.h
@@ -619,10 +619,12 @@ __SYSCALL(__NR_sync_file_range, sys_sync
 __SYSCALL(__NR_vmsplice, sys_vmsplice)
 #define __NR_move_pages		279
 __SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_execns		280
+__SYSCALL(__NR_execns, stub_execns)
 
 #ifdef __KERNEL__
 
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_execns
 #include <linux/err.h>
 
 #ifndef __NO_STUBS

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH -mm 4/7] add execns syscall to i386
  2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
                   ` (2 preceding siblings ...)
  2006-07-11  7:50 ` [PATCH -mm 3/7] add execns syscall to x86_64 Cedric Le Goater
@ 2006-07-11  7:50 ` Cedric Le Goater
  2006-07-11  7:50 ` [PATCH -mm 5/7] add user namespace Cedric Le Goater
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Cedric Le Goater, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

[-- Attachment #1: execns-syscall-i386.patch --]
[-- Type: text/plain, Size: 2542 bytes --]

This patch adds the execns() syscall to the i386 architecture.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>

---
 arch/i386/kernel/process.c       |   31 +++++++++++++++++++++++++++++++
 arch/i386/kernel/syscall_table.S |    1 +
 include/asm-i386/unistd.h        |    3 ++-
 3 files changed, 34 insertions(+), 1 deletion(-)

Index: 2.6.18-rc1-mm1/arch/i386/kernel/process.c
===================================================================
--- 2.6.18-rc1-mm1.orig/arch/i386/kernel/process.c
+++ 2.6.18-rc1-mm1/arch/i386/kernel/process.c
@@ -768,6 +768,37 @@ out:
 	return error;
 }
 
+/*
+ * sys_execns() executes a new program and unshares selected
+ * namespaces.
+ */
+asmlinkage int sys_execns(struct pt_regs regs)
+{
+	int error;
+	int flags;
+	char * filename;
+
+	flags = regs.ebx;
+	filename = getname((char __user *) regs.ecx);
+	error = PTR_ERR(filename);
+	if (IS_ERR(filename))
+		goto out;
+	error = do_execns(flags, filename,
+			(char __user * __user *) regs.edx,
+			(char __user * __user *) regs.edi,
+			&regs);
+	if (error == 0) {
+		task_lock(current);
+		current->ptrace &= ~PT_DTRACE;
+		task_unlock(current);
+		/* Make sure we don't return using sysenter.. */
+		set_thread_flag(TIF_IRET);
+	}
+	putname(filename);
+out:
+	return error;
+}
+
 #define top_esp                (THREAD_SIZE - sizeof(unsigned long))
 #define top_ebp                (THREAD_SIZE - 2*sizeof(unsigned long))
 
Index: 2.6.18-rc1-mm1/arch/i386/kernel/syscall_table.S
===================================================================
--- 2.6.18-rc1-mm1.orig/arch/i386/kernel/syscall_table.S
+++ 2.6.18-rc1-mm1/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,4 @@ ENTRY(sys_call_table)
 	.long sys_tee			/* 315 */
 	.long sys_vmsplice
 	.long sys_move_pages
+	.long sys_execns
Index: 2.6.18-rc1-mm1/include/asm-i386/unistd.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/asm-i386/unistd.h
+++ 2.6.18-rc1-mm1/include/asm-i386/unistd.h
@@ -323,10 +323,11 @@
 #define __NR_tee		315
 #define __NR_vmsplice		316
 #define __NR_move_pages		317
+#define __NR_execns		318
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 318
+#define NR_syscalls 319
 #include <linux/err.h>
 
 /*

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH -mm 5/7] add user namespace
  2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
                   ` (3 preceding siblings ...)
  2006-07-11  7:50 ` [PATCH -mm 4/7] add execns syscall to i386 Cedric Le Goater
@ 2006-07-11  7:50 ` Cedric Le Goater
  2006-07-11 16:39   ` Kirill Korotaev
  2006-07-12  3:46   ` Eric W. Biederman
  2006-07-11  7:50 ` [PATCH -mm 6/7] add the user namespace to the execns syscall Cedric Le Goater
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Cedric Le Goater, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

[-- Attachment #1: user-namespace.patch --]
[-- Type: text/plain, Size: 14013 bytes --]

This patch adds the user namespace.

Basically, it allows a process to unshare its user_struct table,
resetting at the same time its own user_struct and all the associated
accounting.

For the moment, the root_user is added to the new user namespace when
it is cloned. An alternative behavior would be to let the system
allocate a new user_struct(0) in each new user namespace. However,
these 0 users would not have the privileges of the root_user and it
would be necessary to work on the process capabilities to give them
some permissions.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>

---
 fs/ioprio.c               |    5 +
 include/linux/init_task.h |    2 
 include/linux/nsproxy.h   |    2 
 include/linux/sched.h     |    6 +-
 include/linux/user.h      |   45 +++++++++++++++
 init/Kconfig              |    8 ++
 kernel/nsproxy.c          |   15 ++++-
 kernel/sys.c              |    8 +-
 kernel/user.c             |  135 ++++++++++++++++++++++++++++++++++++++++++----
 9 files changed, 208 insertions(+), 18 deletions(-)

Index: 2.6.18-rc1-mm1/kernel/user.c
===================================================================
--- 2.6.18-rc1-mm1.orig/kernel/user.c
+++ 2.6.18-rc1-mm1/kernel/user.c
@@ -14,20 +14,28 @@
 #include <linux/bitops.h>
 #include <linux/key.h>
 #include <linux/interrupt.h>
+#include <linux/user.h>
+#include <linux/module.h>
+#include <linux/nsproxy.h>
 
 /*
  * UID task count cache, to get fast user lookup in "alloc_uid"
  * when changing user ID's (ie setuid() and friends).
  */
 
-#define UIDHASH_BITS (CONFIG_BASE_SMALL ? 3 : 8)
-#define UIDHASH_SZ		(1 << UIDHASH_BITS)
 #define UIDHASH_MASK		(UIDHASH_SZ - 1)
 #define __uidhashfn(uid)	(((uid >> UIDHASH_BITS) + uid) & UIDHASH_MASK)
-#define uidhashentry(uid)	(uidhash_table + __uidhashfn((uid)))
+#define uidhashentry(ns, uid)	((ns)->uidhash_table + __uidhashfn((uid)))
 
 static kmem_cache_t *uid_cachep;
-static struct list_head uidhash_table[UIDHASH_SZ];
+
+struct user_namespace init_user_ns = {
+	.kref = {
+		.refcount	= ATOMIC_INIT(2),
+	},
+};
+
+EXPORT_SYMBOL_GPL(init_user_ns);
 
 /*
  * The uidhash_lock is mostly taken from process context, but it is
@@ -84,19 +92,126 @@ static inline struct user_struct *uid_ha
 	return NULL;
 }
 
+
+#ifdef CONFIG_USER_NS
+
+/*
+ * Clone a new ns copying an original user ns, setting refcount to 1
+ * @old_ns: namespace to clone
+ * Return NULL on error (failure to kmalloc), new ns otherwise
+ */
+static struct user_namespace *clone_user_ns(struct user_namespace *old_ns)
+{
+	struct user_namespace *ns;
+
+	ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
+	if (ns) {
+		int n;
+		struct user_struct *new_user;
+
+		kref_init(&ns->kref);
+
+		for(n = 0; n < UIDHASH_SZ; ++n)
+			INIT_LIST_HEAD(ns->uidhash_table + n);
+
+		/* Insert default root user. An alternate solution
+		 * would be to let the system allocate a new user 0
+		 * for this namespace and work on capabilities to give
+		 * the process some privileges.
+		 */
+		spin_lock_irq(&uidhash_lock);
+		uid_hash_insert(&root_user, uidhashentry(ns, 0));
+		spin_unlock_irq(&uidhash_lock);
+
+		/* Reset current->user with a new one */
+		new_user = alloc_uid(ns, current->uid);
+		if (!new_user) {
+			kfree(ns);
+			return NULL;
+		}
+
+		switch_uid(new_user);
+	}
+	return ns;
+}
+
+/*
+ * unshare the current process' user namespace.
+ */
+int unshare_user_ns(unsigned long unshare_flags,
+		    struct user_namespace **new_user)
+{
+	if (unshare_flags & CLONE_NEWUSER) {
+		if (!capable(CAP_SYS_ADMIN))
+			return -EPERM;
+
+		*new_user = clone_user_ns(current->nsproxy->user_ns);
+		if (!*new_user)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+/*
+ * Copy task tsk's user namespace, or clone it if flags specifies
+ * CLONE_NEWUSER. In latter case, changes to the user namespace of
+ * this process won't be seen by parent, and vice versa.
+ */
+int copy_user_ns(int flags, struct task_struct *tsk)
+{
+	struct user_namespace *old_ns = tsk->nsproxy->user_ns;
+	struct user_namespace *new_ns;
+	int err = 0;
+
+	if (!old_ns)
+		return 0;
+
+	get_user_ns(old_ns);
+
+	if (!(flags & CLONE_NEWUSER))
+		return 0;
+
+	if (!capable(CAP_SYS_ADMIN)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	new_ns = clone_user_ns(old_ns);
+	if (!new_ns) {
+		err = -ENOMEM;
+		goto out;
+	}
+	tsk->nsproxy->user_ns = new_ns;
+
+out:
+	put_user_ns(old_ns);
+	return err;
+}
+
+void free_user_ns(struct kref *kref)
+{
+	struct user_namespace *ns;
+
+	ns = container_of(kref, struct user_namespace, kref);
+	kfree(ns);
+}
+
+#endif /* CONFIG_USER_NS */
+
 /*
  * Locate the user_struct for the passed UID.  If found, take a ref on it.  The
  * caller must undo that ref with free_uid().
  *
  * If the user_struct could not be found, return NULL.
  */
-struct user_struct *find_user(uid_t uid)
+struct user_struct *find_user(struct user_namespace *ns, uid_t uid)
 {
 	struct user_struct *ret;
 	unsigned long flags;
 
 	spin_lock_irqsave(&uidhash_lock, flags);
-	ret = uid_hash_find(uid, uidhashentry(uid));
+	ret = uid_hash_find(uid, uidhashentry(ns, uid));
 	spin_unlock_irqrestore(&uidhash_lock, flags);
 	return ret;
 }
@@ -120,9 +235,9 @@ void free_uid(struct user_struct *up)
 	}
 }
 
-struct user_struct * alloc_uid(uid_t uid)
+struct user_struct * alloc_uid(struct user_namespace *ns, uid_t uid)
 {
-	struct list_head *hashent = uidhashentry(uid);
+	struct list_head *hashent = uidhashentry(ns, uid);
 	struct user_struct *up;
 
 	spin_lock_irq(&uidhash_lock);
@@ -200,11 +315,11 @@ static int __init uid_cache_init(void)
 			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
 
 	for(n = 0; n < UIDHASH_SZ; ++n)
-		INIT_LIST_HEAD(uidhash_table + n);
+		INIT_LIST_HEAD(init_user_ns.uidhash_table + n);
 
 	/* Insert the root user immediately (init already runs as root) */
 	spin_lock_irq(&uidhash_lock);
-	uid_hash_insert(&root_user, uidhashentry(0));
+	uid_hash_insert(&root_user, uidhashentry(&init_user_ns, 0));
 	spin_unlock_irq(&uidhash_lock);
 
 	return 0;
Index: 2.6.18-rc1-mm1/include/linux/nsproxy.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/linux/nsproxy.h
+++ 2.6.18-rc1-mm1/include/linux/nsproxy.h
@@ -7,6 +7,7 @@
 struct namespace;
 struct uts_namespace;
 struct ipc_namespace;
+struct user_namespace;
 
 /*
  * A structure to contain pointers to all per-process
@@ -25,6 +26,7 @@ struct nsproxy {
 	spinlock_t nslock;
 	struct uts_namespace *uts_ns;
 	struct ipc_namespace *ipc_ns;
+	struct user_namespace *user_ns;
 	struct namespace *namespace;
 };
 extern struct nsproxy init_nsproxy;
Index: 2.6.18-rc1-mm1/include/linux/user.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/linux/user.h
+++ 2.6.18-rc1-mm1/include/linux/user.h
@@ -1 +1,46 @@
+#ifndef _LINUX_USER_H
+#define _LINUX_USER_H
+
 #include <asm/user.h>
+
+#define UIDHASH_BITS (CONFIG_BASE_SMALL ? 3 : 8)
+#define UIDHASH_SZ		(1 << UIDHASH_BITS)
+
+struct user_namespace {
+	struct kref		kref;
+	struct list_head	uidhash_table[UIDHASH_SZ];
+};
+
+extern struct user_namespace init_user_ns;
+
+static inline void get_user_ns(struct user_namespace *ns)
+{
+	kref_get(&ns->kref);
+}
+
+#ifdef CONFIG_USER_NS
+extern int unshare_user_ns(unsigned long unshare_flags,
+			   struct user_namespace **new_user);
+extern int copy_user_ns(int flags, struct task_struct *tsk);
+extern void free_user_ns(struct kref *kref);
+
+static inline void put_user_ns(struct user_namespace *ns)
+{
+	kref_put(&ns->kref, free_user_ns);
+}
+#else
+static inline int unshare_user_ns(unsigned long unshare_flags,
+			struct user_namespace **new_user)
+{
+	return -EINVAL;
+}
+static inline int copy_user_ns(int flags, struct task_struct *tsk)
+{
+	return 0;
+}
+static inline void put_user_ns(struct user_namespace *ns)
+{
+}
+#endif /* CONFIG_USER_NS */
+
+#endif /* _LINUX_USER_H */
Index: 2.6.18-rc1-mm1/kernel/nsproxy.c
===================================================================
--- 2.6.18-rc1-mm1.orig/kernel/nsproxy.c
+++ 2.6.18-rc1-mm1/kernel/nsproxy.c
@@ -18,6 +18,7 @@
 #include <linux/nsproxy.h>
 #include <linux/namespace.h>
 #include <linux/utsname.h>
+#include <linux/user.h>
 
 static inline void get_nsproxy(struct nsproxy *ns)
 {
@@ -65,6 +66,8 @@ struct nsproxy *dup_namespaces(struct ns
 			get_uts_ns(ns->uts_ns);
 		if (ns->ipc_ns)
 			get_ipc_ns(ns->ipc_ns);
+		if (ns->user_ns)
+			get_user_ns(ns->user_ns);
 	}
 
 	return ns;
@@ -85,7 +88,8 @@ int copy_namespaces(int flags, struct ta
 
 	get_nsproxy(old_ns);
 
-	if (!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC)))
+	if (!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
+			CLONE_NEWUSER)))
 		return 0;
 
 	new_ns = clone_namespaces(old_ns);
@@ -108,10 +112,17 @@ int copy_namespaces(int flags, struct ta
 	if (err)
 		goto out_ipc;
 
+	err = copy_user_ns(flags, tsk);
+	if (err)
+		goto out_user;
+
 out:
 	put_nsproxy(old_ns);
 	return err;
 
+out_user:
+	if (new_ns->ipc_ns)
+		put_ipc_ns(new_ns->ipc_ns);
 out_ipc:
 	if (new_ns->uts_ns)
 		put_uts_ns(new_ns->uts_ns);
@@ -132,5 +143,7 @@ void free_nsproxy(struct nsproxy *ns)
 			put_uts_ns(ns->uts_ns);
 		if (ns->ipc_ns)
 			put_ipc_ns(ns->ipc_ns);
+		if (ns->user_ns)
+			put_user_ns(ns->user_ns);
 		kfree(ns);
 }
Index: 2.6.18-rc1-mm1/include/linux/sched.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/linux/sched.h
+++ 2.6.18-rc1-mm1/include/linux/sched.h
@@ -26,6 +26,7 @@
 #define CLONE_STOPPED		0x02000000	/* Start in stopped state */
 #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
 #define CLONE_NEWIPC		0x08000000	/* New ipcs */
+#define CLONE_NEWUSER		0x10000000	/* New user */
 
 /*
  * Scheduling policies
@@ -243,6 +244,7 @@ extern signed long schedule_timeout_unin
 asmlinkage void schedule(void);
 
 struct nsproxy;
+struct user_namespace;
 
 /* Maximum number of active map areas.. This is a random (large) number */
 #define DEFAULT_MAX_MAP_COUNT	65536
@@ -537,7 +539,7 @@ struct user_struct {
 	uid_t uid;
 };
 
-extern struct user_struct *find_user(uid_t);
+extern struct user_struct *find_user(struct user_namespace *, uid_t);
 
 extern struct user_struct root_user;
 #define INIT_USER (&root_user)
@@ -1204,7 +1206,7 @@ extern void set_special_pids(pid_t sessi
 extern void __set_special_pids(pid_t session, pid_t pgrp);
 
 /* per-UID process charging. */
-extern struct user_struct * alloc_uid(uid_t);
+extern struct user_struct * alloc_uid(struct user_namespace *, uid_t);
 static inline struct user_struct *get_uid(struct user_struct *u)
 {
 	atomic_inc(&u->__count);
Index: 2.6.18-rc1-mm1/init/Kconfig
===================================================================
--- 2.6.18-rc1-mm1.orig/init/Kconfig
+++ 2.6.18-rc1-mm1/init/Kconfig
@@ -237,6 +237,14 @@ config UTS_NS
 	  vservers, to use uts namespaces to provide different
 	  uts info for different servers.  If unsure, say N.
 
+config USER_NS
+	bool "User Namespaces"
+	default n
+	help
+	  Support user namespaces.  This allows containers, i.e.
+	  vservers, to use user namespaces to provide different
+	  user info for different servers.  If unsure, say N.
+
 config AUDIT
 	bool "Auditing support"
 	depends on NET
Index: 2.6.18-rc1-mm1/include/linux/init_task.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/linux/init_task.h
+++ 2.6.18-rc1-mm1/include/linux/init_task.h
@@ -8,6 +8,7 @@
 #include <linux/lockdep.h>
 #include <linux/notifier.h>
 #include <linux/ipc.h>
+#include <linux/user.h>
 
 #define INIT_FDTABLE \
 {							\
@@ -78,6 +79,7 @@ extern struct nsproxy init_nsproxy;
 	.uts_ns		= &init_uts_ns,					\
 	.namespace	= NULL,						\
 	INIT_IPC_NS(ipc_ns)						\
+	.user_ns	= &init_user_ns,				\
 }
 
 #define INIT_SIGHAND(sighand) {						\
Index: 2.6.18-rc1-mm1/fs/ioprio.c
===================================================================
--- 2.6.18-rc1-mm1.orig/fs/ioprio.c
+++ 2.6.18-rc1-mm1/fs/ioprio.c
@@ -25,6 +25,7 @@
 #include <linux/capability.h>
 #include <linux/syscalls.h>
 #include <linux/security.h>
+#include <linux/nsproxy.h>
 
 static int set_task_ioprio(struct task_struct *task, int ioprio)
 {
@@ -101,7 +102,7 @@ asmlinkage long sys_ioprio_set(int which
 			if (!who)
 				user = current->user;
 			else
-				user = find_user(who);
+				user = find_user(current->nsproxy->user_ns, who);
 
 			if (!user)
 				break;
@@ -171,7 +172,7 @@ asmlinkage long sys_ioprio_get(int which
 			if (!who)
 				user = current->user;
 			else
-				user = find_user(who);
+				user = find_user(current->nsproxy->user_ns, who);
 
 			if (!user)
 				break;
Index: 2.6.18-rc1-mm1/kernel/sys.c
===================================================================
--- 2.6.18-rc1-mm1.orig/kernel/sys.c
+++ 2.6.18-rc1-mm1/kernel/sys.c
@@ -544,7 +544,8 @@ asmlinkage long sys_setpriority(int whic
 			if (!who)
 				who = current->uid;
 			else
-				if ((who != current->uid) && !(user = find_user(who)))
+				if ((who != current->uid) &&
+					!(user = find_user(current->nsproxy->user_ns, who)))
 					goto out_unlock;	/* No processes for this user */
 
 			do_each_thread(g, p)
@@ -602,7 +603,8 @@ asmlinkage long sys_getpriority(int whic
 			if (!who)
 				who = current->uid;
 			else
-				if ((who != current->uid) && !(user = find_user(who)))
+				if ((who != current->uid) &&
+					!(user = find_user(current->nsproxy->user_ns, who)))
 					goto out_unlock;	/* No processes for this user */
 
 			do_each_thread(g, p)
@@ -935,7 +937,7 @@ static int set_user(uid_t new_ruid, int 
 {
 	struct user_struct *new_user;
 
-	new_user = alloc_uid(new_ruid);
+	new_user = alloc_uid(current->nsproxy->user_ns, new_ruid);
 	if (!new_user)
 		return -EAGAIN;
 

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-11  7:50 ` [PATCH -mm 5/7] add user namespace Cedric Le Goater
@ 2006-07-11 16:39   ` Kirill Korotaev
  2006-07-11 17:38     ` Cedric Le Goater
  2006-07-12  3:33     ` Eric W. Biederman
  2006-07-12  3:46   ` Eric W. Biederman
  1 sibling, 2 replies; 107+ messages in thread
From: Kirill Korotaev @ 2006-07-11 16:39 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Eric W. Biederman, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

I wonder about another namespace coupling I found thinking
about your patches.

Lets take a look at sys_setpriority() or any other function calling find_user():
it can change the priority for all user or group processes like:

do_each_thread_ve(g, p) {
    if (p->uid == who)
        error = set_one_prio(p, niceval, error);
} while_each_thread_ve(g, p);

which essentially means that user-namespace becomes coupled with
process-namespace. Sure, we can check in every such place for
  p->nsproxy->user_ns == current->nsproxy->user_ns
condition. But this a way IMHO leading to kernel full of security
crap which is hardly maintainable.

Another example of not so evident coupling here:
user structure maintains number of processes/opened files/sigpending/locked_shm etc.
if a single user can belong to different proccess/ipc/... namespaces
all these becomes unusable.

Small patch comment:
- what is the reason in adding 2nd arg to find_user()?

Thanks,
Kirill

> This patch adds the user namespace.
> 
> Basically, it allows a process to unshare its user_struct table,
> resetting at the same time its own user_struct and all the associated
> accounting.
> 
> For the moment, the root_user is added to the new user namespace when
> it is cloned. An alternative behavior would be to let the system
> allocate a new user_struct(0) in each new user namespace. However,
> these 0 users would not have the privileges of the root_user and it
> would be necessary to work on the process capabilities to give them
> some permissions.
> 
> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
> Cc: Andrew Morton <akpm@osdl.org>
> Cc: Kirill Korotaev <dev@openvz.org>
> Cc: Andrey Savochkin <saw@sw.ru>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: Herbert Poetzl <herbert@13thfloor.at>
> Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
> Cc: Serge E. Hallyn <serue@us.ibm.com>
> Cc: Dave Hansen <haveblue@us.ibm.com>
> 
> ---
>  fs/ioprio.c               |    5 +
>  include/linux/init_task.h |    2 
>  include/linux/nsproxy.h   |    2 
>  include/linux/sched.h     |    6 +-
>  include/linux/user.h      |   45 +++++++++++++++
>  init/Kconfig              |    8 ++
>  kernel/nsproxy.c          |   15 ++++-
>  kernel/sys.c              |    8 +-
>  kernel/user.c             |  135 ++++++++++++++++++++++++++++++++++++++++++----
>  9 files changed, 208 insertions(+), 18 deletions(-)
> 
> Index: 2.6.18-rc1-mm1/kernel/user.c
> ===================================================================
> --- 2.6.18-rc1-mm1.orig/kernel/user.c
> +++ 2.6.18-rc1-mm1/kernel/user.c
> @@ -14,20 +14,28 @@
>  #include <linux/bitops.h>
>  #include <linux/key.h>
>  #include <linux/interrupt.h>
> +#include <linux/user.h>
> +#include <linux/module.h>
> +#include <linux/nsproxy.h>
>  
>  /*
>   * UID task count cache, to get fast user lookup in "alloc_uid"
>   * when changing user ID's (ie setuid() and friends).
>   */
>  
> -#define UIDHASH_BITS (CONFIG_BASE_SMALL ? 3 : 8)
> -#define UIDHASH_SZ		(1 << UIDHASH_BITS)
>  #define UIDHASH_MASK		(UIDHASH_SZ - 1)
>  #define __uidhashfn(uid)	(((uid >> UIDHASH_BITS) + uid) & UIDHASH_MASK)
> -#define uidhashentry(uid)	(uidhash_table + __uidhashfn((uid)))
> +#define uidhashentry(ns, uid)	((ns)->uidhash_table + __uidhashfn((uid)))
>  
>  static kmem_cache_t *uid_cachep;
> -static struct list_head uidhash_table[UIDHASH_SZ];
> +
> +struct user_namespace init_user_ns = {
> +	.kref = {
> +		.refcount	= ATOMIC_INIT(2),
> +	},
> +};
> +
> +EXPORT_SYMBOL_GPL(init_user_ns);
>  
>  /*
>   * The uidhash_lock is mostly taken from process context, but it is
> @@ -84,19 +92,126 @@ static inline struct user_struct *uid_ha
>  	return NULL;
>  }
>  
> +
> +#ifdef CONFIG_USER_NS
> +
> +/*
> + * Clone a new ns copying an original user ns, setting refcount to 1
> + * @old_ns: namespace to clone
> + * Return NULL on error (failure to kmalloc), new ns otherwise
> + */
> +static struct user_namespace *clone_user_ns(struct user_namespace *old_ns)
> +{
> +	struct user_namespace *ns;
> +
> +	ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
> +	if (ns) {
> +		int n;
> +		struct user_struct *new_user;
> +
> +		kref_init(&ns->kref);
> +
> +		for(n = 0; n < UIDHASH_SZ; ++n)
> +			INIT_LIST_HEAD(ns->uidhash_table + n);
> +
> +		/* Insert default root user. An alternate solution
> +		 * would be to let the system allocate a new user 0
> +		 * for this namespace and work on capabilities to give
> +		 * the process some privileges.
> +		 */
> +		spin_lock_irq(&uidhash_lock);
> +		uid_hash_insert(&root_user, uidhashentry(ns, 0));
> +		spin_unlock_irq(&uidhash_lock);
> +
> +		/* Reset current->user with a new one */
> +		new_user = alloc_uid(ns, current->uid);
> +		if (!new_user) {
> +			kfree(ns);
> +			return NULL;
> +		}
> +
> +		switch_uid(new_user);
> +	}
> +	return ns;
> +}
> +
> +/*
> + * unshare the current process' user namespace.
> + */
> +int unshare_user_ns(unsigned long unshare_flags,
> +		    struct user_namespace **new_user)
> +{
> +	if (unshare_flags & CLONE_NEWUSER) {
> +		if (!capable(CAP_SYS_ADMIN))
> +			return -EPERM;
> +
> +		*new_user = clone_user_ns(current->nsproxy->user_ns);
> +		if (!*new_user)
> +			return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Copy task tsk's user namespace, or clone it if flags specifies
> + * CLONE_NEWUSER. In latter case, changes to the user namespace of
> + * this process won't be seen by parent, and vice versa.
> + */
> +int copy_user_ns(int flags, struct task_struct *tsk)
> +{
> +	struct user_namespace *old_ns = tsk->nsproxy->user_ns;
> +	struct user_namespace *new_ns;
> +	int err = 0;
> +
> +	if (!old_ns)
> +		return 0;
> +
> +	get_user_ns(old_ns);
> +
> +	if (!(flags & CLONE_NEWUSER))
> +		return 0;
> +
> +	if (!capable(CAP_SYS_ADMIN)) {
> +		err = -EPERM;
> +		goto out;
> +	}
> +
> +	new_ns = clone_user_ns(old_ns);
> +	if (!new_ns) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +	tsk->nsproxy->user_ns = new_ns;
> +
> +out:
> +	put_user_ns(old_ns);
> +	return err;
> +}
> +
> +void free_user_ns(struct kref *kref)
> +{
> +	struct user_namespace *ns;
> +
> +	ns = container_of(kref, struct user_namespace, kref);
> +	kfree(ns);
> +}
> +
> +#endif /* CONFIG_USER_NS */
> +
>  /*
>   * Locate the user_struct for the passed UID.  If found, take a ref on it.  The
>   * caller must undo that ref with free_uid().
>   *
>   * If the user_struct could not be found, return NULL.
>   */
> -struct user_struct *find_user(uid_t uid)
> +struct user_struct *find_user(struct user_namespace *ns, uid_t uid)
>  {
>  	struct user_struct *ret;
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&uidhash_lock, flags);
> -	ret = uid_hash_find(uid, uidhashentry(uid));
> +	ret = uid_hash_find(uid, uidhashentry(ns, uid));
>  	spin_unlock_irqrestore(&uidhash_lock, flags);
>  	return ret;
>  }
> @@ -120,9 +235,9 @@ void free_uid(struct user_struct *up)
>  	}
>  }
>  
> -struct user_struct * alloc_uid(uid_t uid)
> +struct user_struct * alloc_uid(struct user_namespace *ns, uid_t uid)
>  {
> -	struct list_head *hashent = uidhashentry(uid);
> +	struct list_head *hashent = uidhashentry(ns, uid);
>  	struct user_struct *up;
>  
>  	spin_lock_irq(&uidhash_lock);
> @@ -200,11 +315,11 @@ static int __init uid_cache_init(void)
>  			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL);
>  
>  	for(n = 0; n < UIDHASH_SZ; ++n)
> -		INIT_LIST_HEAD(uidhash_table + n);
> +		INIT_LIST_HEAD(init_user_ns.uidhash_table + n);
>  
>  	/* Insert the root user immediately (init already runs as root) */
>  	spin_lock_irq(&uidhash_lock);
> -	uid_hash_insert(&root_user, uidhashentry(0));
> +	uid_hash_insert(&root_user, uidhashentry(&init_user_ns, 0));
>  	spin_unlock_irq(&uidhash_lock);
>  
>  	return 0;
> Index: 2.6.18-rc1-mm1/include/linux/nsproxy.h
> ===================================================================
> --- 2.6.18-rc1-mm1.orig/include/linux/nsproxy.h
> +++ 2.6.18-rc1-mm1/include/linux/nsproxy.h
> @@ -7,6 +7,7 @@
>  struct namespace;
>  struct uts_namespace;
>  struct ipc_namespace;
> +struct user_namespace;
>  
>  /*
>   * A structure to contain pointers to all per-process
> @@ -25,6 +26,7 @@ struct nsproxy {
>  	spinlock_t nslock;
>  	struct uts_namespace *uts_ns;
>  	struct ipc_namespace *ipc_ns;
> +	struct user_namespace *user_ns;
>  	struct namespace *namespace;
>  };
>  extern struct nsproxy init_nsproxy;
> Index: 2.6.18-rc1-mm1/include/linux/user.h
> ===================================================================
> --- 2.6.18-rc1-mm1.orig/include/linux/user.h
> +++ 2.6.18-rc1-mm1/include/linux/user.h
> @@ -1 +1,46 @@
> +#ifndef _LINUX_USER_H
> +#define _LINUX_USER_H
> +
>  #include <asm/user.h>
> +
> +#define UIDHASH_BITS (CONFIG_BASE_SMALL ? 3 : 8)
> +#define UIDHASH_SZ		(1 << UIDHASH_BITS)
> +
> +struct user_namespace {
> +	struct kref		kref;
> +	struct list_head	uidhash_table[UIDHASH_SZ];
> +};
> +
> +extern struct user_namespace init_user_ns;
> +
> +static inline void get_user_ns(struct user_namespace *ns)
> +{
> +	kref_get(&ns->kref);
> +}
> +
> +#ifdef CONFIG_USER_NS
> +extern int unshare_user_ns(unsigned long unshare_flags,
> +			   struct user_namespace **new_user);
> +extern int copy_user_ns(int flags, struct task_struct *tsk);
> +extern void free_user_ns(struct kref *kref);
> +
> +static inline void put_user_ns(struct user_namespace *ns)
> +{
> +	kref_put(&ns->kref, free_user_ns);
> +}
> +#else
> +static inline int unshare_user_ns(unsigned long unshare_flags,
> +			struct user_namespace **new_user)
> +{
> +	return -EINVAL;
> +}
> +static inline int copy_user_ns(int flags, struct task_struct *tsk)
> +{
> +	return 0;
> +}
> +static inline void put_user_ns(struct user_namespace *ns)
> +{
> +}
> +#endif /* CONFIG_USER_NS */
> +
> +#endif /* _LINUX_USER_H */
> Index: 2.6.18-rc1-mm1/kernel/nsproxy.c
> ===================================================================
> --- 2.6.18-rc1-mm1.orig/kernel/nsproxy.c
> +++ 2.6.18-rc1-mm1/kernel/nsproxy.c
> @@ -18,6 +18,7 @@
>  #include <linux/nsproxy.h>
>  #include <linux/namespace.h>
>  #include <linux/utsname.h>
> +#include <linux/user.h>
>  
>  static inline void get_nsproxy(struct nsproxy *ns)
>  {
> @@ -65,6 +66,8 @@ struct nsproxy *dup_namespaces(struct ns
>  			get_uts_ns(ns->uts_ns);
>  		if (ns->ipc_ns)
>  			get_ipc_ns(ns->ipc_ns);
> +		if (ns->user_ns)
> +			get_user_ns(ns->user_ns);
>  	}
>  
>  	return ns;
> @@ -85,7 +88,8 @@ int copy_namespaces(int flags, struct ta
>  
>  	get_nsproxy(old_ns);
>  
> -	if (!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC)))
> +	if (!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> +			CLONE_NEWUSER)))
>  		return 0;
>  
>  	new_ns = clone_namespaces(old_ns);
> @@ -108,10 +112,17 @@ int copy_namespaces(int flags, struct ta
>  	if (err)
>  		goto out_ipc;
>  
> +	err = copy_user_ns(flags, tsk);
> +	if (err)
> +		goto out_user;
> +
>  out:
>  	put_nsproxy(old_ns);
>  	return err;
>  
> +out_user:
> +	if (new_ns->ipc_ns)
> +		put_ipc_ns(new_ns->ipc_ns);
>  out_ipc:
>  	if (new_ns->uts_ns)
>  		put_uts_ns(new_ns->uts_ns);
> @@ -132,5 +143,7 @@ void free_nsproxy(struct nsproxy *ns)
>  			put_uts_ns(ns->uts_ns);
>  		if (ns->ipc_ns)
>  			put_ipc_ns(ns->ipc_ns);
> +		if (ns->user_ns)
> +			put_user_ns(ns->user_ns);
>  		kfree(ns);
>  }
> Index: 2.6.18-rc1-mm1/include/linux/sched.h
> ===================================================================
> --- 2.6.18-rc1-mm1.orig/include/linux/sched.h
> +++ 2.6.18-rc1-mm1/include/linux/sched.h
> @@ -26,6 +26,7 @@
>  #define CLONE_STOPPED		0x02000000	/* Start in stopped state */
>  #define CLONE_NEWUTS		0x04000000	/* New utsname group? */
>  #define CLONE_NEWIPC		0x08000000	/* New ipcs */
> +#define CLONE_NEWUSER		0x10000000	/* New user */
>  
>  /*
>   * Scheduling policies
> @@ -243,6 +244,7 @@ extern signed long schedule_timeout_unin
>  asmlinkage void schedule(void);
>  
>  struct nsproxy;
> +struct user_namespace;
>  
>  /* Maximum number of active map areas.. This is a random (large) number */
>  #define DEFAULT_MAX_MAP_COUNT	65536
> @@ -537,7 +539,7 @@ struct user_struct {
>  	uid_t uid;
>  };
>  
> -extern struct user_struct *find_user(uid_t);
> +extern struct user_struct *find_user(struct user_namespace *, uid_t);
>  
>  extern struct user_struct root_user;
>  #define INIT_USER (&root_user)
> @@ -1204,7 +1206,7 @@ extern void set_special_pids(pid_t sessi
>  extern void __set_special_pids(pid_t session, pid_t pgrp);
>  
>  /* per-UID process charging. */
> -extern struct user_struct * alloc_uid(uid_t);
> +extern struct user_struct * alloc_uid(struct user_namespace *, uid_t);
>  static inline struct user_struct *get_uid(struct user_struct *u)
>  {
>  	atomic_inc(&u->__count);
> Index: 2.6.18-rc1-mm1/init/Kconfig
> ===================================================================
> --- 2.6.18-rc1-mm1.orig/init/Kconfig
> +++ 2.6.18-rc1-mm1/init/Kconfig
> @@ -237,6 +237,14 @@ config UTS_NS
>  	  vservers, to use uts namespaces to provide different
>  	  uts info for different servers.  If unsure, say N.
>  
> +config USER_NS
> +	bool "User Namespaces"
> +	default n
> +	help
> +	  Support user namespaces.  This allows containers, i.e.
> +	  vservers, to use user namespaces to provide different
> +	  user info for different servers.  If unsure, say N.
> +
>  config AUDIT
>  	bool "Auditing support"
>  	depends on NET
> Index: 2.6.18-rc1-mm1/include/linux/init_task.h
> ===================================================================
> --- 2.6.18-rc1-mm1.orig/include/linux/init_task.h
> +++ 2.6.18-rc1-mm1/include/linux/init_task.h
> @@ -8,6 +8,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/notifier.h>
>  #include <linux/ipc.h>
> +#include <linux/user.h>
>  
>  #define INIT_FDTABLE \
>  {							\
> @@ -78,6 +79,7 @@ extern struct nsproxy init_nsproxy;
>  	.uts_ns		= &init_uts_ns,					\
>  	.namespace	= NULL,						\
>  	INIT_IPC_NS(ipc_ns)						\
> +	.user_ns	= &init_user_ns,				\
>  }
>  
>  #define INIT_SIGHAND(sighand) {						\
> Index: 2.6.18-rc1-mm1/fs/ioprio.c
> ===================================================================
> --- 2.6.18-rc1-mm1.orig/fs/ioprio.c
> +++ 2.6.18-rc1-mm1/fs/ioprio.c
> @@ -25,6 +25,7 @@
>  #include <linux/capability.h>
>  #include <linux/syscalls.h>
>  #include <linux/security.h>
> +#include <linux/nsproxy.h>
>  
>  static int set_task_ioprio(struct task_struct *task, int ioprio)
>  {
> @@ -101,7 +102,7 @@ asmlinkage long sys_ioprio_set(int which
>  			if (!who)
>  				user = current->user;
>  			else
> -				user = find_user(who);
> +				user = find_user(current->nsproxy->user_ns, who);
>  
>  			if (!user)
>  				break;
> @@ -171,7 +172,7 @@ asmlinkage long sys_ioprio_get(int which
>  			if (!who)
>  				user = current->user;
>  			else
> -				user = find_user(who);
> +				user = find_user(current->nsproxy->user_ns, who);
>  
>  			if (!user)
>  				break;
> Index: 2.6.18-rc1-mm1/kernel/sys.c
> ===================================================================
> --- 2.6.18-rc1-mm1.orig/kernel/sys.c
> +++ 2.6.18-rc1-mm1/kernel/sys.c
> @@ -544,7 +544,8 @@ asmlinkage long sys_setpriority(int whic
>  			if (!who)
>  				who = current->uid;
>  			else
> -				if ((who != current->uid) && !(user = find_user(who)))
> +				if ((who != current->uid) &&
> +					!(user = find_user(current->nsproxy->user_ns, who)))
>  					goto out_unlock;	/* No processes for this user */
>  
>  			do_each_thread(g, p)
> @@ -602,7 +603,8 @@ asmlinkage long sys_getpriority(int whic
>  			if (!who)
>  				who = current->uid;
>  			else
> -				if ((who != current->uid) && !(user = find_user(who)))
> +				if ((who != current->uid) &&
> +					!(user = find_user(current->nsproxy->user_ns, who)))
>  					goto out_unlock;	/* No processes for this user */
>  
>  			do_each_thread(g, p)
> @@ -935,7 +937,7 @@ static int set_user(uid_t new_ruid, int 
>  {
>  	struct user_struct *new_user;
>  
> -	new_user = alloc_uid(new_ruid);
> +	new_user = alloc_uid(current->nsproxy->user_ns, new_ruid);
>  	if (!new_user)
>  		return -EAGAIN;
>  
> 
> --
> 


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-11 16:39   ` Kirill Korotaev
@ 2006-07-11 17:38     ` Cedric Le Goater
  2006-07-12 11:21       ` Kirill Korotaev
  2006-07-12  3:33     ` Eric W. Biederman
  1 sibling, 1 reply; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11 17:38 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Eric W. Biederman, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Kirill Korotaev wrote:
> I wonder about another namespace coupling I found thinking
> about your patches.
> 
> Lets take a look at sys_setpriority() or any other function calling
> find_user():
> it can change the priority for all user or group processes like:
> 
> do_each_thread_ve(g, p) {
>    if (p->uid == who)
>        error = set_one_prio(p, niceval, error);
> } while_each_thread_ve(g, p);

eh. this is openvz code ! thanks :)

> which essentially means that user-namespace becomes coupled with
> process-namespace. Sure, we can check in every such place for
>  p->nsproxy->user_ns == current->nsproxy->user_ns
> condition. But this a way IMHO leading to kernel full of security
> crap which is hardly maintainable.

only 4 syscalls use find_user() : sys_setpriority, sys_getpriority,
sys_ioprio_set, sys_ioprio_get and they use it very simply to check if a
user_struct exists for a given uid. So, it should be OK. But please see the
attached patch.

> Another example of not so evident coupling here:
> user structure maintains number of processes/opened
> files/sigpending/locked_shm etc.
> if a single user can belong to different proccess/ipc/... namespaces
> all these becomes unusable.

this is the purpose of execns.

user namespace can't be unshared through the unshare syscall(). they can
only be unshared through execns() which flushes the previous image of the
process. However, the execns patch still needs to close files without the
close-on-exec flag. I didn't do it yet. lazy me :)

> Small patch comment:
> - what is the reason in adding 2nd arg to find_user()?

The main reason is alloc_uid() in clone_user_ns() and find_user() inherited
the same prototype maybe abusively. Here's a patch.

thanks,

C.

From: Cedric Le Goater <clg@fr.ibm.com>
Subject: remove user namespace arg from find_user

This patch removes the user namespace argument from find_user().

find_user() is always called from the current context, hence the user
namespace can be determined with current->nsproxy->user_ns in
find_user().

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>

---
 fs/ioprio.c           |    4 ++--
 include/linux/sched.h |    2 +-
 kernel/sys.c          |    4 ++--
 kernel/user.c         |    3 ++-
 4 files changed, 7 insertions(+), 6 deletions(-)

Index: 2.6.18-rc1-mm1/fs/ioprio.c
===================================================================
--- 2.6.18-rc1-mm1.orig/fs/ioprio.c
+++ 2.6.18-rc1-mm1/fs/ioprio.c
@@ -102,7 +102,7 @@ asmlinkage long sys_ioprio_set(int which
 			if (!who)
 				user = current->user;
 			else
-				user = find_user(current->nsproxy->user_ns, who);
+				user = find_user(who);

 			if (!user)
 				break;
@@ -172,7 +172,7 @@ asmlinkage long sys_ioprio_get(int which
 			if (!who)
 				user = current->user;
 			else
-				user = find_user(current->nsproxy->user_ns, who);
+				user = find_user(who);

 			if (!user)
 				break;
Index: 2.6.18-rc1-mm1/include/linux/sched.h
===================================================================
--- 2.6.18-rc1-mm1.orig/include/linux/sched.h
+++ 2.6.18-rc1-mm1/include/linux/sched.h
@@ -539,7 +539,7 @@ struct user_struct {
 	uid_t uid;
 };

-extern struct user_struct *find_user(struct user_namespace *, uid_t);
+extern struct user_struct *find_user(uid_t);

 extern struct user_struct root_user;
 #define INIT_USER (&root_user)
Index: 2.6.18-rc1-mm1/kernel/sys.c
===================================================================
--- 2.6.18-rc1-mm1.orig/kernel/sys.c
+++ 2.6.18-rc1-mm1/kernel/sys.c
@@ -545,7 +545,7 @@ asmlinkage long sys_setpriority(int whic
 				who = current->uid;
 			else
 				if ((who != current->uid) &&
-					!(user = find_user(current->nsproxy->user_ns, who)))
+					!(user = find_user(who)))
 					goto out_unlock;	/* No processes for this user */

 			do_each_thread(g, p)
@@ -604,7 +604,7 @@ asmlinkage long sys_getpriority(int whic
 				who = current->uid;
 			else
 				if ((who != current->uid) &&
-					!(user = find_user(current->nsproxy->user_ns, who)))
+					!(user = find_user(who)))
 					goto out_unlock;	/* No processes for this user */

 			do_each_thread(g, p)
Index: 2.6.18-rc1-mm1/kernel/user.c
===================================================================
--- 2.6.18-rc1-mm1.orig/kernel/user.c
+++ 2.6.18-rc1-mm1/kernel/user.c
@@ -205,10 +205,11 @@ void free_user_ns(struct kref *kref)
  *
  * If the user_struct could not be found, return NULL.
  */
-struct user_struct *find_user(struct user_namespace *ns, uid_t uid)
+struct user_struct *find_user(uid_t uid)
 {
 	struct user_struct *ret;
 	unsigned long flags;
+	struct user_namespace *ns = current->nsproxy->user_ns;

 	spin_lock_irqsave(&uidhash_lock, flags);
 	ret = uid_hash_find(uid, uidhashentry(ns, uid));


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-11 17:38     ` Cedric Le Goater
@ 2006-07-12 11:21       ` Kirill Korotaev
  2006-07-13 16:01         ` Cedric Le Goater
  0 siblings, 1 reply; 107+ messages in thread
From: Kirill Korotaev @ 2006-07-12 11:21 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Eric W. Biederman, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

>>Lets take a look at sys_setpriority() or any other function calling
>>find_user():
>>it can change the priority for all user or group processes like:
>>
>>do_each_thread_ve(g, p) {
>>   if (p->uid == who)
>>       error = set_one_prio(p, niceval, error);
>>} while_each_thread_ve(g, p);
> 
> 
> eh. this is openvz code ! thanks :)
it doesn't matter :)
2.6.17 code is:
                        do_each_thread(g, p)
                                if (p->uid == who)
                                        error = set_one_prio(p, niceval, error);
                        while_each_thread(g, p);

when introducing process namespaces we will have to isolate processes somehow and this loop, agree?
in this case 1 user-namespace can belong to 2 process-namespaces, agree?
how do you see this loop in the future making sure that above situation is handled correctly?
how many other such places do we have?

>>which essentially means that user-namespace becomes coupled with
>>process-namespace. Sure, we can check in every such place for
>> p->nsproxy->user_ns == current->nsproxy->user_ns
>>condition. But this a way IMHO leading to kernel full of security
>>crap which is hardly maintainable.
> 
> 
> only 4 syscalls use find_user() : sys_setpriority, sys_getpriority,
> sys_ioprio_set, sys_ioprio_get and they use it very simply to check if a
> user_struct exists for a given uid. So, it should be OK. But please see the
> attached patch.
the problem is not in find_user() actually. but in uid comparison inside
some kind of process iteration loop.
In this case you select processes p which belong to both namespaces simultenously:
i.e. processes p which belong both to user-namespace U and process-namespace P.

I hope I was more clear this time :)

>>Another example of not so evident coupling here:
>>user structure maintains number of processes/opened
>>files/sigpending/locked_shm etc.
>>if a single user can belong to different proccess/ipc/... namespaces
>>all these becomes unusable.
> 
> 
> this is the purpose of execns.
> 
> user namespace can't be unshared through the unshare syscall().
why? we do it fine in OpenVZ.

> they can
> only be unshared through execns() which flushes the previous image of the
> process. However, the execns patch still needs to close files without the
> close-on-exec flag. I didn't do it yet. lazy me :)

Kirill


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-12 11:21       ` Kirill Korotaev
@ 2006-07-13 16:01         ` Cedric Le Goater
  0 siblings, 0 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-13 16:01 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Eric W. Biederman, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Kirill Korotaev wrote:
>>> Lets take a look at sys_setpriority() or any other function calling
>>> find_user():
>>> it can change the priority for all user or group processes like:
>>>
>>> do_each_thread_ve(g, p) {
>>>   if (p->uid == who)
>>>       error = set_one_prio(p, niceval, error);
>>> } while_each_thread_ve(g, p);
>>
>>
>> eh. this is openvz code ! thanks :)
> it doesn't matter :)

it does. it means for me that you are studying proposals to see how if fits
with your existing code. which is good.

> 2.6.17 code is:
>                        do_each_thread(g, p)
>                                if (p->uid == who)
>                                        error = set_one_prio(p, niceval,
> error);
>                        while_each_thread(g, p);
> 
> when introducing process namespaces we will have to isolate processes
> somehow and this loop, agree?

yes

> in this case 1 user-namespace can belong to 2 process-namespaces, agree?
> how do you see this loop in the future making sure that above situation
> is handled correctly?

IMO, the loop should apply to the current->pidspace or equivalent inside
the loop

> how many other such places do we have?

if it's embedded in the loop, it should not be too much of an issue ?

>>> which essentially means that user-namespace becomes coupled with
>>> process-namespace. Sure, we can check in every such place for
>>> p->nsproxy->user_ns == current->nsproxy->user_ns
>>> condition. But this a way IMHO leading to kernel full of security
>>> crap which is hardly maintainable.
>>
>> only 4 syscalls use find_user() : sys_setpriority, sys_getpriority,
>> sys_ioprio_set, sys_ioprio_get and they use it very simply to check if a
>> user_struct exists for a given uid. So, it should be OK. But please
>> see the attached patch.
>
> the problem is not in find_user() actually. but in uid comparison inside
> some kind of process iteration loop. In this case you select processes
> p which belong to both namespaces simultenously: i.e. processes p which
> belong both to user-namespace U and process-namespace P.
> 
> I hope I was more clear this time :)

yes thanks,

for the moment, if processes are not isolated in some others ways, like in
openvz, these kind of loops would need the extra test 'p->nsproxy->user_ns
== current->nsproxy->user_ns' on user namespace to be valid. same issue for
filesystem and many other places. eric raised that point.

In theory, if I understand well eric's concept of namespaces, a task
belongs to a union of namespaces : ipc, process, user, net, utsname, fs,
etc. some of these namespaces could be default namespaces and some not
because they were unshared in some way: clone, unshare, exec, but in a safe
way.

They are necessary bricks for a bigger abstraction, let's call it
container, but they not sufficient by them selves because they have
dependencies. The container comes as a whole and not subsystem by
subsystem, I agree with you on that point.

>>> Another example of not so evident coupling here:
>>> user structure maintains number of processes/opened
>>> files/sigpending/locked_shm etc.
>>> if a single user can belong to different proccess/ipc/... namespaces
>>> all these becomes unusable.
>>
>>
>> this is the purpose of execns.
>>
>> user namespace can't be unshared through the unshare syscall().
>
> why? we do it fine in OpenVZ.

probably because you use the full container approach in openvz and start
the container by running init ? namespaces are a bit more painful ... I agree.

I'm still struggling with the limits of that namespace concept. Hopefully,
we meet next week because I'm also reaching my limits of digital
interaction on this topic :)

thanks,

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-11 16:39   ` Kirill Korotaev
  2006-07-11 17:38     ` Cedric Le Goater
@ 2006-07-12  3:33     ` Eric W. Biederman
  2006-07-12 11:13       ` Kirill Korotaev
  1 sibling, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-12  3:33 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Kirill Korotaev <dev@sw.ru> writes:

> Another example of not so evident coupling here:
> user structure maintains number of processes/opened files/sigpending/locked_shm
> etc.
> if a single user can belong to different proccess/ipc/... namespaces
> all these becomes unusable.

Why do the count of the number of objects a user has become
unusable if they can count objects in multiple namespaces?

Namespaces are about how names are looked up and how names are
created.  Namespaces are not about the objects those names refer to.

Eric


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-12  3:33     ` Eric W. Biederman
@ 2006-07-12 11:13       ` Kirill Korotaev
  2006-07-12 18:10         ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Kirill Korotaev @ 2006-07-12 11:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

>>Another example of not so evident coupling here:
>>user structure maintains number of processes/opened files/sigpending/locked_shm
>>etc.
>>if a single user can belong to different proccess/ipc/... namespaces
>>all these becomes unusable.
> 
> 
> Why do the count of the number of objects a user has become
> unusable if they can count objects in multiple namespaces?
> 
> Namespaces are about how names are looked up and how names are
> created.  Namespaces are not about the objects those names refer to.

One example below, which I believe is a bug due to coupling.
Will be glad to hear your opinion on this.

Let user u to unshare its process namespace and run e.g. httpd inside newly created 2nd process namespace.
Now imagine that user u hits his process rlimit.
He can't kill his httpd's because they are in another process namespace. He can kill visible to his bash processes from
1st process namespace, but httpd can spawn childs more after that. So we end up with the situation
when user u can't control his processes, nor run any other processes in his bash.

I'm fine with such situations, since we need containers mostly, but what makes me
really afraid is that it introduces hard to find/fix/maintain issues. I have no any other concerns.

Thanks,
Kirill

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-12 11:13       ` Kirill Korotaev
@ 2006-07-12 18:10         ` Eric W. Biederman
  2006-07-13 17:00           ` Cedric Le Goater
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-12 18:10 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Kirill Korotaev <dev@sw.ru> writes:

>>>Another example of not so evident coupling here:
>>>user structure maintains number of processes/opened
> files/sigpending/locked_shm
>>>etc.
>>>if a single user can belong to different proccess/ipc/... namespaces
>>>all these becomes unusable.
>> Why do the count of the number of objects a user has become
>> unusable if they can count objects in multiple namespaces?
>> Namespaces are about how names are looked up and how names are
>> created.  Namespaces are not about the objects those names refer to.
>
> One example below, which I believe is a bug due to coupling.
> Will be glad to hear your opinion on this.
>
> Let user u to unshare its process namespace and run e.g. httpd inside newly
> created 2nd process namespace.
> Now imagine that user u hits his process rlimit.
> He can't kill his httpd's because they are in another process namespace. He can
> kill visible to his bash processes from
> 1st process namespace, but httpd can spawn childs more after that. So we end up
> with the situation
> when user u can't control his processes, nor run any other processes in his
> bash.

Yes, this can happen.   But as described this really is a usage problem.  I would
expect if your uid is in use in multiple places you will have accesses to all of
those places.  Part of this won't be clear until we sort out the process id
namespace though.

> I'm fine with such situations, since we need containers mostly, but what makes
> me
> really afraid is that it introduces hard to find/fix/maintain issues. I have no
> any other concerns.

Hard to find and maintain problems I agree should be avoided.  There are only two
ways I can see coping with the weird interactions that might occur.
1) Assert weird interactions will never happen, don't worry about it,
   and stomp on any place where they can occur.  (A fully isolated container approach).

2) Assume weird interactions happen and write the code so that it simply
   works if those interactions happen, because for each namespace you have
   made certain the code works regardless of which namespace the objects are
   in.

The second case is slightly harder.  But as far as I can tell it is more robust
and allows for much better incremental development.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-12 18:10         ` Eric W. Biederman
@ 2006-07-13 17:00           ` Cedric Le Goater
  2006-07-13 18:07             ` Eric W. Biederman
  2006-07-13 18:21             ` Eric W. Biederman
  0 siblings, 2 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-13 17:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kirill Korotaev, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Eric W. Biederman wrote:

>> I'm fine with such situations, since we need containers mostly, but what makes
>> me really afraid is that it introduces hard to find/fix/maintain issues. I have no
>> any other concerns.
> 
> Hard to find and maintain problems I agree should be avoided.  There are only two
> ways I can see coping with the weird interactions that might occur.
>
> 1) Assert weird interactions will never happen, don't worry about it,
>    and stomp on any place where they can occur.  (A fully isolated container approach).
> 
> 2) Assume weird interactions happen and write the code so that it simply
>    works if those interactions happen, because for each namespace you have
>    made certain the code works regardless of which namespace the objects are
>    in.
> 
> The second case is slightly harder.  But as far as I can tell it is more robust
> and allows for much better incremental development.

hmm, slightly, I would say much harder and these weird interactions are
very hard to anticipate without some experience in the field. We could
continue on arguing for ages without making any progress.

let's apply that incremental development approach now. Let's work on simple
namespaces which would make _some_ container scenarios possible and not
all. IMHO, that would mean tying some namespaces together and find a way to
unshare them safely as a whole. Get some experience on it and then work on
unsharing some more independently for the benefit of more use case
scenarios. I like the concept and I think it will be useful.

just being pragmatic, i like things to start working in simple cases before
over optimizing them.

cheers,

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 17:00           ` Cedric Le Goater
@ 2006-07-13 18:07             ` Eric W. Biederman
  2006-07-13 18:21             ` Eric W. Biederman
  1 sibling, 0 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-13 18:07 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Kirill Korotaev, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Cedric Le Goater <clg@fr.ibm.com> writes:

> hmm, slightly, I would say much harder and these weird interactions are
> very hard to anticipate without some experience in the field. We could
> continue on arguing for ages without making any progress.

I have not been intending to argue but to point out short comings.

> let's apply that incremental development approach now. Let's work on simple
> namespaces which would make _some_ container scenarios possible and not
> all. IMHO, that would mean tying some namespaces together and find a way to
> unshare them safely as a whole. Get some experience on it and then work on
> unsharing some more independently for the benefit of more use case
> scenarios. I like the concept and I think it will be useful.

I think we have a different conception of where the problems lie.

The easy cases I see are everything as a unit, or everything as
separate pieces.  I do not see natural connection between namespaces
that will help us out.

> just being pragmatic, i like things to start working in simple cases before
> over optimizing them.

I agree we incremental development is good and what we need.

The hard part is that we can never undo any part of our user interface.
So we must have complete an sane semantics when we implement a namespace,
before it gets merged anywhere beyond a purely development tree.

Then in addition to that usually you find that the existing implementation
does not have a good 1-1 change and must be refactored and have the cruft
removed before you can start extending it.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 17:00           ` Cedric Le Goater
  2006-07-13 18:07             ` Eric W. Biederman
@ 2006-07-13 18:21             ` Eric W. Biederman
  2006-07-13 18:31               ` Dave Hansen
  1 sibling, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-13 18:21 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Kirill Korotaev, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Ok.  Stepping back a little ways.

We need a formula for doing incremental development that will allow us to
make headway while not seeing the entire picture all at once.  The scope
is just too large.

The usual pattern is for someone to do implement the code in the order
and the fashion that seems obvious and then to reimplement it with a
history that is aids code reviews, and makes it obvious what is
happening.  The reimplementation results in the exact same code but
the steps to get there are different.

We can either do that as individuals or as a group.  But that
is the only pattern I know of that allows us to be comprehensive
when we merge and fall short during development.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 18:21             ` Eric W. Biederman
@ 2006-07-13 18:31               ` Dave Hansen
  2006-07-13 18:54                 ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2006-07-13 18:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cedric Le Goater, Kirill Korotaev, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn

On Thu, 2006-07-13 at 12:21 -0600, Eric W. Biederman wrote:
> We need a formula for doing incremental development that will allow us to
> make headway while not seeing the entire picture all at once.  The scope
> is just too large.

Definitely.  We need a low-risk development environment where we can put
test-fit pieces together, but also not worry too much of we have to rip
pieces out, or completely change them.

I'm not sure we *need* to rewrite things for review-ability later.  I
think some of us have gotten pretty good at keeping our development in
reviewable bits as we go along.

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 18:31               ` Dave Hansen
@ 2006-07-13 18:54                 ` Eric W. Biederman
  0 siblings, 0 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-13 18:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Cedric Le Goater, Kirill Korotaev, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn

Dave Hansen <haveblue@us.ibm.com> writes:

> On Thu, 2006-07-13 at 12:21 -0600, Eric W. Biederman wrote:
>> We need a formula for doing incremental development that will allow us to
>> make headway while not seeing the entire picture all at once.  The scope
>> is just too large.
>
> Definitely.  We need a low-risk development environment where we can put
> test-fit pieces together, but also not worry too much of we have to rip
> pieces out, or completely change them.

It probably makes sense to talk about this up in Ottowa.

> I'm not sure we *need* to rewrite things for review-ability later.  I
> think some of us have gotten pretty good at keeping our development in
> reviewable bits as we go along.

Well the rewrite of the history may simply be an ordering change.

It isn't so much the reviewability of a single piece but the reviewability
of the whole that I am worried about.  I have had one occasion lately
where I made a small simple sane change but I discovered that there
was one little hack I had to get rid of to make everything work.

To remove that hack required me to refactor another piece of code and
was more work than the original change.  That refactoring then had to happen
first, when the code was merged with the rest of the kernel.

So while we should try to keep our pieces as sane as possible we should
assume we have to rewrite history from our sandbox when we push the code
upstream.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-11  7:50 ` [PATCH -mm 5/7] add user namespace Cedric Le Goater
  2006-07-11 16:39   ` Kirill Korotaev
@ 2006-07-12  3:46   ` Eric W. Biederman
  2006-07-12 12:05     ` Herbert Poetzl
  2006-07-12 14:00     ` Cedric Le Goater
  1 sibling, 2 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-12  3:46 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Cedric Le Goater <clg@fr.ibm.com> writes:

> This patch adds the user namespace.
>
> Basically, it allows a process to unshare its user_struct table,
> resetting at the same time its own user_struct and all the associated
> accounting.
>
> For the moment, the root_user is added to the new user namespace when
> it is cloned. An alternative behavior would be to let the system
> allocate a new user_struct(0) in each new user namespace. However,
> these 0 users would not have the privileges of the root_user and it
> would be necessary to work on the process capabilities to give them
> some permissions.

It is completely the wrong thing for a the root_user to span multiple
namespaces as you describe.  It is important for uid 0 in other namespaces
to not have the privileges of the root_user.  That is half the point.

Too many files in sysfs and proc don't require caps but instead simply
limit things to uid 0.  Having a separate uid 0 in the different namespaces
instantly makes all of these files inaccessible, and keeps processes from
doing something bad.

To a filesystem a uid does not share a uid namespace with the only things
that should be accessible are those things that are readable/writeable
by everyone.   Unless the filesystem has provisions for storing multiple
uid namespaces not files should be able to be created.  Think NFS root
squash.

> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
> Cc: Andrew Morton <akpm@osdl.org>
> Cc: Kirill Korotaev <dev@openvz.org>
> Cc: Andrey Savochkin <saw@sw.ru>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: Herbert Poetzl <herbert@13thfloor.at>
> Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
> Cc: Serge E. Hallyn <serue@us.ibm.com>
> Cc: Dave Hansen <haveblue@us.ibm.com>
>
> ---
>  fs/ioprio.c               |    5 +
>  include/linux/init_task.h |    2 
>  include/linux/nsproxy.h   |    2 
>  include/linux/sched.h     |    6 +-
>  include/linux/user.h      |   45 +++++++++++++++
>  init/Kconfig              |    8 ++
>  kernel/nsproxy.c          |   15 ++++-
>  kernel/sys.c              |    8 +-
>  kernel/user.c | 135 ++++++++++++++++++++++++++++++++++++++++++----

This patch looks extremly incomplete.

Every comparison of a user id needs to compare the tuple
(user namespace, user id) or it needs to compare struct users.

Ever comparison of a group id needs to compare the tuple
(user namespace, group id) or it needs to compare struct users.

I think the key infrastructure needs to be looked at here as well.

There needs to be a user namespace association for mounted filesystems.

We need a discussion about how we handle map users from one user
namespace to another, because without some form of mapping so many
things become inaccessible that the system is almost useless.

I believe some of the key infrastructure which is roughly kerberos
authentication tokens could be used for this purpose.

A user namespace is a big thing.  What I see here doesn't even
seem to scratch the surface.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-12  3:46   ` Eric W. Biederman
@ 2006-07-12 12:05     ` Herbert Poetzl
  2006-07-12 17:09       ` Eric W. Biederman
  2006-07-12 14:00     ` Cedric Le Goater
  1 sibling, 1 reply; 107+ messages in thread
From: Herbert Poetzl @ 2006-07-12 12:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Sam Vilain, Serge E. Hallyn, Dave Hansen

On Tue, Jul 11, 2006 at 09:46:01PM -0600, Eric W. Biederman wrote:
> Cedric Le Goater <clg@fr.ibm.com> writes:
> 
> > This patch adds the user namespace.
> >
> > Basically, it allows a process to unshare its user_struct table,
> > resetting at the same time its own user_struct and all the
> > associated accounting.
> >
> > For the moment, the root_user is added to the new user namespace
> > when it is cloned. An alternative behavior would be to let the
> > system allocate a new user_struct(0) in each new user namespace.
> > However, these 0 users would not have the privileges of the
> > root_user and it would be necessary to work on the process
> > capabilities to give them some permissions.
> 
> It is completely the wrong thing for a the root_user to span multiple
> namespaces as you describe. It is important for uid 0 in other
> namespaces to not have the privileges of the root_user. That is half
> the point.
>
> Too many files in sysfs and proc don't require caps but instead simply
> limit things to uid 0. Having a separate uid 0 in the different
> namespaces instantly makes all of these files inaccessible, and keeps
> processes from doing something bad.

well, here I'd definitely prefer to fix up that 'broken'
entries by adding proper capability checks, and maybe
even a bunch of new capabilities (i.e. 64bit caps and
such) first, because IMHO the capability system is the
'proper' method to protect them in the first place

> To a filesystem a uid does not share a uid namespace with the
> only things that should be accessible are those things that are
> readable/writeable by everyone. Unless the filesystem has provisions
> for storing multiple uid namespaces not files should be able to be
> created. Think NFS root squash.

that's where file tagging as Linux-VServer does it can
be used to 'share' a partition between different guests
(and have separate disk limits and quotas)

best,
Herbert

> > Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
> > Cc: Andrew Morton <akpm@osdl.org>
> > Cc: Kirill Korotaev <dev@openvz.org>
> > Cc: Andrey Savochkin <saw@sw.ru>
> > Cc: Eric W. Biederman <ebiederm@xmission.com>
> > Cc: Herbert Poetzl <herbert@13thfloor.at>
> > Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
> > Cc: Serge E. Hallyn <serue@us.ibm.com>
> > Cc: Dave Hansen <haveblue@us.ibm.com>
> >
> > ---
> >  fs/ioprio.c               |    5 +
> >  include/linux/init_task.h |    2 
> >  include/linux/nsproxy.h   |    2 
> >  include/linux/sched.h     |    6 +-
> >  include/linux/user.h      |   45 +++++++++++++++
> >  init/Kconfig              |    8 ++
> >  kernel/nsproxy.c          |   15 ++++-
> >  kernel/sys.c              |    8 +-
> >  kernel/user.c | 135 ++++++++++++++++++++++++++++++++++++++++++----
> 
> This patch looks extremly incomplete.
> 
> Every comparison of a user id needs to compare the tuple
> (user namespace, user id) or it needs to compare struct users.
> 
> Ever comparison of a group id needs to compare the tuple
> (user namespace, group id) or it needs to compare struct users.
> 
> I think the key infrastructure needs to be looked at here as well.
> 
> There needs to be a user namespace association for mounted filesystems.
> 
> We need a discussion about how we handle map users from one user
> namespace to another, because without some form of mapping so many
> things become inaccessible that the system is almost useless.
> 
> I believe some of the key infrastructure which is roughly kerberos
> authentication tokens could be used for this purpose.
> 
> A user namespace is a big thing.  What I see here doesn't even
> seem to scratch the surface.
> 
> Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-12 12:05     ` Herbert Poetzl
@ 2006-07-12 17:09       ` Eric W. Biederman
  0 siblings, 0 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-12 17:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Sam Vilain, Serge E. Hallyn, Dave Hansen

Herbert Poetzl <herbert@13thfloor.at> writes:

> On Tue, Jul 11, 2006 at 09:46:01PM -0600, Eric W. Biederman wrote:
>> Cedric Le Goater <clg@fr.ibm.com> writes:
>> 
>> > This patch adds the user namespace.
>> >
>> > Basically, it allows a process to unshare its user_struct table,
>> > resetting at the same time its own user_struct and all the
>> > associated accounting.
>> >
>> > For the moment, the root_user is added to the new user namespace
>> > when it is cloned. An alternative behavior would be to let the
>> > system allocate a new user_struct(0) in each new user namespace.
>> > However, these 0 users would not have the privileges of the
>> > root_user and it would be necessary to work on the process
>> > capabilities to give them some permissions.
>> 
>> It is completely the wrong thing for a the root_user to span multiple
>> namespaces as you describe. It is important for uid 0 in other
>> namespaces to not have the privileges of the root_user. That is half
>> the point.
>>
>> Too many files in sysfs and proc don't require caps but instead simply
>> limit things to uid 0. Having a separate uid 0 in the different
>> namespaces instantly makes all of these files inaccessible, and keeps
>> processes from doing something bad.
>
> well, here I'd definitely prefer to fix up that 'broken'
> entries by adding proper capability checks, and maybe
> even a bunch of new capabilities (i.e. 64bit caps and
> such) first, because IMHO the capability system is the
> 'proper' method to protect them in the first place

I guess the good way to think of some of this is not as broken files.
But rather in the plan9 model.  Initially everything on the machine is
owned by one user (in our case root).  That user then controls who
else gets access to files by calling chown etc.

If you want access to those files in a different uid namespace someone
needs to call chown or at least chmod on them, and I'm not
at all certain this as a problem is limited to special files on
a filesystem.

Really when you do a proper user namespace this all comes for free.

>> To a filesystem a uid does not share a uid namespace with the
>> only things that should be accessible are those things that are
>> readable/writeable by everyone. Unless the filesystem has provisions
>> for storing multiple uid namespaces not files should be able to be
>> created. Think NFS root squash.
>
> that's where file tagging as Linux-VServer does it can
> be used to 'share' a partition between different guests
> (and have separate disk limits and quotas)

I completely agree we need a capability like that.  I disagree on the
implementation because it is not general.  What we have really hit is
the classic uid mapping problem that shows up occasionally with
network filesystems.  That is the problem we need to solve.

I'm pretty certain the current kernel key infrastructure gives us the
infrastructure we need to do that in a general way.  At which point
it would be a policy decision which uid to map to which other uid
and we could easily match the current 

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-12  3:46   ` Eric W. Biederman
  2006-07-12 12:05     ` Herbert Poetzl
@ 2006-07-12 14:00     ` Cedric Le Goater
  2006-07-12 17:24       ` Eric W. Biederman
  1 sibling, 1 reply; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-12 14:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Eric W. Biederman wrote:
> Cedric Le Goater <clg@fr.ibm.com> writes:
> 
>> This patch adds the user namespace.
>>
>> Basically, it allows a process to unshare its user_struct table,
>> resetting at the same time its own user_struct and all the associated
>> accounting.
>>
>> For the moment, the root_user is added to the new user namespace when
>> it is cloned. An alternative behavior would be to let the system
>> allocate a new user_struct(0) in each new user namespace. However,
>> these 0 users would not have the privileges of the root_user and it
>> would be necessary to work on the process capabilities to give them
>> some permissions.
> 
> It is completely the wrong thing for a the root_user to span multiple
> namespaces as you describe.  It is important for uid 0 in other namespaces
> to not have the privileges of the root_user.  That is half the point.

ok good. that's what i thought also.

> Too many files in sysfs and proc don't require caps but instead simply
> limit things to uid 0.  Having a separate uid 0 in the different namespaces
> instantly makes all of these files inaccessible, and keeps processes from
> doing something bad.

but in order to be useful, the uid 0 in other namespaces will need to have
some capabilities.

> To a filesystem a uid does not share a uid namespace with the only things
> that should be accessible are those things that are readable/writeable
> by everyone.   Unless the filesystem has provisions for storing multiple
> uid namespaces not files should be able to be created.  Think NFS root
> squash.

I think user namespace should be unshared with filesystem. if not, the
user/admin should know what is doing.

> Every comparison of a user id needs to compare the tuple
> (user namespace, user id) or it needs to compare struct users.
>
> Ever comparison of a group id needs to compare the tuple
> (user namespace, group id) or it needs to compare struct users.

yes, that would be the ultimate user namespace.

I think that this first patchset lays some infrastructure that is already
quite usable in a container which already isolates file, pids, etc and not
only users.

now, we could work on extending it to support fine grain user namespace
which i think can done on top of this first patch.

> I think the key infrastructure needs to be looked at here as well.
>
> There needs to be a user namespace association for mounted filesystems.

yes you could expect that to check the i_uid against fsuid but should we
enforce it completely ?

we already have an issue today with a simple NFS mount on 2 hosts with
different user mapping. namespace can't fix all issues.

> We need a discussion about how we handle map users from one user
> namespace to another, because without some form of mapping so many
> things become inaccessible that the system is almost useless.
> 
> I believe some of the key infrastructure which is roughly kerberos
> authentication tokens could be used for this purpose.

please elaborate ? i'm not sure to understand why you want to use the keys
to map users.

> A user namespace is a big thing.  What I see here doesn't even
> seem to scratch the surface.

good let's start digging !

thanks,

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-12 14:00     ` Cedric Le Goater
@ 2006-07-12 17:24       ` Eric W. Biederman
  2006-07-13 17:36         ` Cedric Le Goater
  2006-07-14 14:17         ` Serge E. Hallyn
  0 siblings, 2 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-12 17:24 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Cedric Le Goater <clg@fr.ibm.com> writes:

> Eric W. Biederman wrote:
>> Cedric Le Goater <clg@fr.ibm.com> writes:
>> 
>>> This patch adds the user namespace.
>>>
>>> Basically, it allows a process to unshare its user_struct table,
>>> resetting at the same time its own user_struct and all the associated
>>> accounting.
>>>
>>> For the moment, the root_user is added to the new user namespace when
>>> it is cloned. An alternative behavior would be to let the system
>>> allocate a new user_struct(0) in each new user namespace. However,
>>> these 0 users would not have the privileges of the root_user and it
>>> would be necessary to work on the process capabilities to give them
>>> some permissions.
>> 
>> It is completely the wrong thing for a the root_user to span multiple
>> namespaces as you describe.  It is important for uid 0 in other namespaces
>> to not have the privileges of the root_user.  That is half the point.
>
> ok good. that's what i thought also.
>
>> Too many files in sysfs and proc don't require caps but instead simply
>> limit things to uid 0.  Having a separate uid 0 in the different namespaces
>> instantly makes all of these files inaccessible, and keeps processes from
>> doing something bad.
>
> but in order to be useful, the uid 0 in other namespaces will need to have
> some capabilities.

Yes.  It is useful for uid 0 in other namespaces to have some capabilities.
But that is just a matter of giving another user some additional
capabilities.  That mechanism may still work but it should not be
namespace specific magic there.  The trick I guess is which
capabilities a setuid binary for the other root user gets.

One thing to be careful about here is that, at least as I 
envision it until setresuid is called your user does not change
even when you unshare your user namespace.

>> To a filesystem a uid does not share a uid namespace with the only things
>> that should be accessible are those things that are readable/writeable
>> by everyone.   Unless the filesystem has provisions for storing multiple
>> uid namespaces not files should be able to be created.  Think NFS root
>> squash.
>
> I think user namespace should be unshared with filesystem. if not, the
> user/admin should know what is doing.

No.  The uids in a filesystem are interpreted in some user namespace
context.  We can discover that context at the first mount of the
filesystem.  Assuming the uids on a filesystem are the same set
of uids your process is using is just wrong.

>> Every comparison of a user id needs to compare the tuple
>> (user namespace, user id) or it needs to compare struct users.
>>
>> Ever comparison of a group id needs to compare the tuple
>> (user namespace, group id) or it needs to compare struct users.
>
> yes, that would be the ultimate user namespace.
>
> I think that this first patchset lays some infrastructure that is already
> quite usable in a container which already isolates file, pids, etc and not
> only users.
>
> now, we could work on extending it to support fine grain user namespace
> which i think can done on top of this first patch.

Yes.  Your patch does lay some interesting foundation work.
But we must not merge it upstream until we have a complete patchset
that handles all of the user namespace issues.

>> I think the key infrastructure needs to be looked at here as well.
>>
>> There needs to be a user namespace association for mounted filesystems.
>
> yes you could expect that to check the i_uid against fsuid but should we
> enforce it completely ?
>
> we already have an issue today with a simple NFS mount on 2 hosts with
> different user mapping. namespace can't fix all issues.

Yes.  This is an existing problem, which we have just escalated the
frequency of immensely if we are doing user namespaces.  The normal
solution is to put everyone on the network in a single user id
administration domain with ldap or NIS.

However that is avoiding the problem, and having multiple user id
domains is the point of a user id namespace.  If we escalate the
problem we should solve it.

>> We need a discussion about how we handle map users from one user
>> namespace to another, because without some form of mapping so many
>> things become inaccessible that the system is almost useless.
>> 
>> I believe some of the key infrastructure which is roughly kerberos
>> authentication tokens could be used for this purpose.
>
> please elaborate ? i'm not sure to understand why you want to use the keys
> to map users.

keys are essentially security credentials for something besides the
local kernel.  Think kerberos tickets.  That makes the keys the
obvious place to say what uid you are in a different user namespace
and similar things.

>> A user namespace is a big thing.  What I see here doesn't even
>> seem to scratch the surface.
>
> good let's start digging !

One piece at a time :)

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-12 17:24       ` Eric W. Biederman
@ 2006-07-13 17:36         ` Cedric Le Goater
  2006-07-13 17:47           ` Serge E. Hallyn
  2006-07-13 17:59           ` Eric W. Biederman
  2006-07-14 14:17         ` Serge E. Hallyn
  1 sibling, 2 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-13 17:36 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Eric W. Biederman wrote:

>> I think user namespace should be unshared with filesystem. if not, the
>> user/admin should know what is doing.
> 
> No.  The uids in a filesystem are interpreted in some user namespace
> context.  We can discover that context at the first mount of the
> filesystem. 

ok. so once you're in such a user namespace, you can't unshare from it
without loosing access to all your files ?

> Assuming the uids on a filesystem are the same set of uids your process
> is using is just wrong.

well, this is what is currently done without user namespaces.

>>> I believe some of the key infrastructure which is roughly kerberos
>>> authentication tokens could be used for this purpose.
>> please elaborate ? i'm not sure to understand why you want to use the keys
>> to map users.
> 
> keys are essentially security credentials for something besides the
> local kernel.  Think kerberos tickets.  That makes the keys the
> obvious place to say what uid you are in a different user namespace
> and similar things.

what about performance ? wouldn't that slow the checking ?

thanks,

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 17:36         ` Cedric Le Goater
@ 2006-07-13 17:47           ` Serge E. Hallyn
  2006-07-13 18:14             ` Eric W. Biederman
  2006-07-13 17:59           ` Eric W. Biederman
  1 sibling, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-13 17:47 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Eric W. Biederman, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Quoting Cedric Le Goater (clg@fr.ibm.com):
> Eric W. Biederman wrote:
> 
> >> I think user namespace should be unshared with filesystem. if not, the
> >> user/admin should know what is doing.
> > 
> > No.  The uids in a filesystem are interpreted in some user namespace
> > context.  We can discover that context at the first mount of the
> > filesystem. 
> 
> ok. so once you're in such a user namespace, you can't unshare from it
> without loosing access to all your files ?
> 
> > Assuming the uids on a filesystem are the same set of uids your process
> > is using is just wrong.
> 
> well, this is what is currently done without user namespaces.
> 
> >>> I believe some of the key infrastructure which is roughly kerberos
> >>> authentication tokens could be used for this purpose.
> >> please elaborate ? i'm not sure to understand why you want to use the keys
> >> to map users.
> > 
> > keys are essentially security credentials for something besides the
> > local kernel.  Think kerberos tickets.  That makes the keys the
> > obvious place to say what uid you are in a different user namespace
> > and similar things.
> 
> what about performance ? wouldn't that slow the checking ?

Rather than try to store (uid, namespace) on the filesystem, I like the
idea of doing something like

	mount --bind -o ro --uidswap 500,1001 --uidswap 501,0 /home /home

In other words, when you unshare the user namespace, nothing
filesystem-related changes unless you also unshare the fs-namespace and
set uid info on the vfsmount.  This is fully backward-compatible and
should have no overhead if you don't need the feature.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 17:47           ` Serge E. Hallyn
@ 2006-07-13 18:14             ` Eric W. Biederman
  2006-07-13 18:29               ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-13 18:14 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Dave Hansen

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Rather than try to store (uid, namespace) on the filesystem, I like the
> idea of doing something like
>
> 	mount --bind -o ro --uidswap 500,1001 --uidswap 501,0 /home /home
>
> In other words, when you unshare the user namespace, nothing
> filesystem-related changes unless you also unshare the fs-namespace and
> set uid info on the vfsmount.  This is fully backward-compatible and
> should have no overhead if you don't need the feature.

Maybe.  I really think the sane semantics are in a different uid namespace.
So you can't assumes uids are the same.  Otherwise you can't handle open
file descriptors or files passed through unix domain sockets.

A user namespace is fundamentally about breaking the assumption that
you can test to see if two users are the same by comparing uids.
If you don't want to break that assumption don't use a user namespace.

Mapping uids is a separate but related problem that we probably need to
tackle in a generic manner, usable for network filesystems.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 18:14             ` Eric W. Biederman
@ 2006-07-13 18:29               ` Dave Hansen
  2006-07-13 19:02                 ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2006-07-13 18:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

On Thu, 2006-07-13 at 12:14 -0600, Eric W. Biederman wrote:
> Maybe.  I really think the sane semantics are in a different uid namespace.
> So you can't assumes uids are the same.  Otherwise you can't handle open
> file descriptors or files passed through unix domain sockets.

Eric, could you explain this a little bit more?  I'm not sure I
understand the details of why this is a problem?

-- Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 18:29               ` Dave Hansen
@ 2006-07-13 19:02                 ` Eric W. Biederman
  2006-07-13 20:03                   ` Dave Hansen
  2006-07-13 21:41                   ` Serge E. Hallyn
  0 siblings, 2 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-13 19:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

Dave Hansen <haveblue@us.ibm.com> writes:

> On Thu, 2006-07-13 at 12:14 -0600, Eric W. Biederman wrote:
>> Maybe.  I really think the sane semantics are in a different uid namespace.
>> So you can't assumes uids are the same.  Otherwise you can't handle open
>> file descriptors or files passed through unix domain sockets.
>
> Eric, could you explain this a little bit more?  I'm not sure I
> understand the details of why this is a problem?

Very simply.

In the presence of a user namespace.  
All comparisons of a user equality need to be of the tuple (user namespace, user id).
Any comparison that does not do that is an optimization.

Because you can have access to files created in another user namespace it
is very unlikely that optimization will apply very frequently.  The easy scenario
to get access to a file descriptor from another context is to consider unix
domain sockets.

So my impression was that Cedric's patchset was overoptimized because
it did not change most of the uid comparisons, to (user namespace, user id).

This is one of those strange cases where the optimization is less work
because it means not applying a patch.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 19:02                 ` Eric W. Biederman
@ 2006-07-13 20:03                   ` Dave Hansen
  2006-07-14  3:45                     ` Eric W. Biederman
  2006-07-13 21:41                   ` Serge E. Hallyn
  1 sibling, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2006-07-13 20:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

On Thu, 2006-07-13 at 13:02 -0600, Eric W. Biederman wrote:
> All comparisons of a user equality need to be of the tuple (user namespace, user id).
> Any comparison that does not do that is an optimization.
...
> So my impression was that Cedric's patchset was overoptimized because
> it did not change most of the uid comparisons, to (user namespace, user id).

I might just be tempted to call them bugs so people understand what I'm
talking about ;)

> Because you can have access to files created in another user namespace it
> is very unlikely that optimization will apply very frequently.  The easy scenario
> to get access to a file descriptor from another context is to consider unix
> domain sockets.

OK, so you're saying that the lack of checks will cause problems rarely,
and that passing a fd across a unix domain sockets is one of the times
when you _could_ encounter this problem?

-- Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 20:03                   ` Dave Hansen
@ 2006-07-14  3:45                     ` Eric W. Biederman
  2006-07-14 14:28                       ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14  3:45 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

Dave Hansen <haveblue@us.ibm.com> writes:

> On Thu, 2006-07-13 at 13:02 -0600, Eric W. Biederman wrote:
>> All comparisons of a user equality need to be of the tuple (user namespace,
> user id).
>> Any comparison that does not do that is an optimization.
> ...
>> So my impression was that Cedric's patchset was overoptimized because
>> it did not change most of the uid comparisons, to (user namespace, user id).
>
> I might just be tempted to call them bugs so people understand what I'm
> talking about ;)
>
>> Because you can have access to files created in another user namespace it
>> is very unlikely that optimization will apply very frequently.  The easy
> scenario
>> to get access to a file descriptor from another context is to consider unix
>> domain sockets.
>
> OK, so you're saying that the lack of checks will cause problems rarely,
> and that passing a fd across a unix domain sockets is one of the times
> when you _could_ encounter this problem?

I think for filesystems like /proc and /sys that there will normally
be problems.  However many of those problems can be rationalized away
as a reasonable optimization, or are not immediately apparent.

Passing a file descriptor between process in a unix domain socket is
a case where I can easily construct scenarios where there are
indisputable problems.  It is one of my standard thought experiments
to see if a namespace is sound.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14  3:45                     ` Eric W. Biederman
@ 2006-07-14 14:28                       ` Dave Hansen
  2006-07-14 15:13                         ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2006-07-14 14:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

On Thu, 2006-07-13 at 21:45 -0600, Eric W. Biederman wrote:
> I think for filesystems like /proc and /sys that there will normally
> be problems.  However many of those problems can be rationalized away
> as a reasonable optimization, or are not immediately apparent.

Could you talk about some of these problems?

> Passing a file descriptor between process in a unix domain socket is
> a case where I can easily construct scenarios where there are
> indisputable problems.  It is one of my standard thought experiments
> to see if a namespace is sound.

Care to share some of those indisputable problems?

-- Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 14:28                       ` Dave Hansen
@ 2006-07-14 15:13                         ` Eric W. Biederman
  2006-07-14 16:29                           ` Serge E. Hallyn
  2006-07-14 16:35                           ` Dave Hansen
  0 siblings, 2 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 15:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

Dave Hansen <haveblue@us.ibm.com> writes:

> On Thu, 2006-07-13 at 21:45 -0600, Eric W. Biederman wrote:
>> I think for filesystems like /proc and /sys that there will normally
>> be problems.  However many of those problems can be rationalized away
>> as a reasonable optimization, or are not immediately apparent.
>
> Could you talk about some of these problems?

Already mentioned but.  rw permissions on sensitive files are for 
uid == 0.  No capability checks are performed.

>> Passing a file descriptor between process in a unix domain socket is
>> a case where I can easily construct scenarios where there are
>> indisputable problems.  It is one of my standard thought experiments
>> to see if a namespace is sound.
>
> Care to share some of those indisputable problems?

Already done.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 15:13                         ` Eric W. Biederman
@ 2006-07-14 16:29                           ` Serge E. Hallyn
  2006-07-14 16:49                             ` Eric W. Biederman
  2006-07-14 16:35                           ` Dave Hansen
  1 sibling, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-14 16:29 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Hansen, Serge E. Hallyn, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Dave Hansen <haveblue@us.ibm.com> writes:
> 
> > On Thu, 2006-07-13 at 21:45 -0600, Eric W. Biederman wrote:
> >> I think for filesystems like /proc and /sys that there will normally
> >> be problems.  However many of those problems can be rationalized away
> >> as a reasonable optimization, or are not immediately apparent.
> >
> > Could you talk about some of these problems?
> 
> Already mentioned but.  rw permissions on sensitive files are for 
> uid == 0.  No capability checks are performed.

As Herbert (IIRC) pointed out that could/should be fixed.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 16:29                           ` Serge E. Hallyn
@ 2006-07-14 16:49                             ` Eric W. Biederman
  2006-07-14 16:55                               ` Dave Hansen
  2006-07-14 17:05                               ` Serge E. Hallyn
  0 siblings, 2 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 16:49 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Dave Hansen <haveblue@us.ibm.com> writes:
>> 
>> > On Thu, 2006-07-13 at 21:45 -0600, Eric W. Biederman wrote:
>> >> I think for filesystems like /proc and /sys that there will normally
>> >> be problems.  However many of those problems can be rationalized away
>> >> as a reasonable optimization, or are not immediately apparent.
>> >
>> > Could you talk about some of these problems?
>> 
>> Already mentioned but.  rw permissions on sensitive files are for 
>> uid == 0.  No capability checks are performed.
>
> As Herbert (IIRC) pointed out that could/should be fixed.

Capabilities have always fitted badly in with the normal unix
permissions. So if we have a solution that works nicely with normal
unix permissions we will have a nice general solution, that is
easy for people to understand.

What I am talking about is making a small tweak to the permission
checking as below.  Why do you keep avoiding even considering it?

Eric

  int generic_permission(struct inode *inode, int mask,
  		int (*check_acl)(struct inode *inode, int mask))
  {
  	umode_t			mode = inode->i_mode;
  
- 	if (current->fsuid == inode->i_uid)
+ 	if ((current->fsuid == inode->i_uid) &&
+ 		(current->nsproxy->user_ns == inode->i_sb->user_ns))
  		mode >>= 6;
  	else {
  		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
  			int error = check_acl(inode, mask);
  			if (error == -EACCES)
  				goto check_capabilities;
  			else if (error != -EAGAIN)
  				return error;
  		}
  
- 		if (in_group_p(inode->i_gid))
+ 		if (in_group_p(inode->i_sb->user_ns, inode->i_gid))
  			mode >>= 3;
  	}
  
  	/*
  	 * If the DACs are ok we don't need any capability check.
  	 */
  	if (((mode & mask & (MAY_READ|MAY_WRITE|MAY_EXEC)) == mask))
  		return 0;
  
   check_capabilities:
  	/*
  	 * Read/write DACs are always overridable.
  	 * Executable DACs are overridable if at least one exec bit is set.
  	 */
  	if (!(mask & MAY_EXEC) ||
  	    (inode->i_mode & S_IXUGO) || S_ISDIR(inode->i_mode))
  		if (capable(CAP_DAC_OVERRIDE))
  			return 0;
  
  	/*
  	 * Searching includes executable on directories, else just read.
  	 */
  	if (mask == MAY_READ || (S_ISDIR(inode->i_mode) && !(mask & MAY_WRITE)))
  		if (capable(CAP_DAC_READ_SEARCH))
  			return 0;
  
  	return -EACCES;
  }


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 16:49                             ` Eric W. Biederman
@ 2006-07-14 16:55                               ` Dave Hansen
  2006-07-14 17:08                                 ` Serge E. Hallyn
                                                   ` (2 more replies)
  2006-07-14 17:05                               ` Serge E. Hallyn
  1 sibling, 3 replies; 107+ messages in thread
From: Dave Hansen @ 2006-07-14 16:55 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

On Fri, 2006-07-14 at 10:49 -0600, Eric W. Biederman wrote:
> -       if (current->fsuid == inode->i_uid)
> +       if ((current->fsuid == inode->i_uid) &&
> +               (current->nsproxy->user_ns == inode->i_sb->user_ns))
>                 mode >>= 6; 

I really don't think assigning a user namespace to a superblock is the
right way to go.  Seems to me like the _view_ into the filesystem is
what you want to modify.  That would seem to me to mean that each
'struct namespace' (filesystem namespace) or vfsmount would be assigned
a corresponding user namespace, *not* the superblock.

-- Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 16:55                               ` Dave Hansen
@ 2006-07-14 17:08                                 ` Serge E. Hallyn
  2006-07-14 17:19                                   ` Dave Hansen
  2006-07-14 17:14                                 ` Eric W. Biederman
  2006-07-16  8:36                                 ` Kirill Korotaev
  2 siblings, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-14 17:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Serge E. Hallyn, Cedric Le Goater,
	linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain

Quoting Dave Hansen (haveblue@us.ibm.com):
> On Fri, 2006-07-14 at 10:49 -0600, Eric W. Biederman wrote:
> > -       if (current->fsuid == inode->i_uid)
> > +       if ((current->fsuid == inode->i_uid) &&
> > +               (current->nsproxy->user_ns == inode->i_sb->user_ns))
> >                 mode >>= 6; 
> 
> I really don't think assigning a user namespace to a superblock is the
> right way to go.  Seems to me like the _view_ into the filesystem is
> what you want to modify.  That would seem to me to mean that each
> 'struct namespace' (filesystem namespace) or vfsmount would be assigned
> a corresponding user namespace, *not* the superblock.

yes, of course, vfsmount, which I assume is what Eric meant?

Which means we'd have to do this at permission() using the nameidata, or
pass nd to generic_permission.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 17:08                                 ` Serge E. Hallyn
@ 2006-07-14 17:19                                   ` Dave Hansen
  2006-07-14 17:36                                     ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2006-07-14 17:19 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

On Fri, 2006-07-14 at 12:08 -0500, Serge E. Hallyn wrote:
> yes, of course, vfsmount, which I assume is what Eric meant?
> 
> Which means we'd have to do this at permission() using the nameidata, or
> pass nd to generic_permission. 

Yeah, I think so.  But, this is well into Al territory, and there might
be a better way.

-- Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 17:19                                   ` Dave Hansen
@ 2006-07-14 17:36                                     ` Eric W. Biederman
  2006-07-14 18:15                                       ` Trond Myklebust
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 17:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

Dave Hansen <haveblue@us.ibm.com> writes:

> On Fri, 2006-07-14 at 12:08 -0500, Serge E. Hallyn wrote:
>> yes, of course, vfsmount, which I assume is what Eric meant?
>> 
>> Which means we'd have to do this at permission() using the nameidata, or
>> pass nd to generic_permission. 
>
> Yeah, I think so.  But, this is well into Al territory, and there might
> be a better way.

Well until we get that sorted out I will keep picking on i_sb.

The one thing that just occurred to me is that the generic permission
should end up structured as below.  I don't know if it is cheaper
but it is certainly more obvious.

Eric


  int generic_permission(struct inode *inode, int mask,
  		int (*check_acl)(struct inode *inode, int mask))
  {
  	umode_t			mode = inode->i_mode;
  
+       if (unlikely(current->nsproxy->user_ns != inode->i_sb->user_ns))
+          	goto unknown_user;

  	if (current->fsuid == inode->i_uid)
  		mode >>= 6;
  	else {
  		if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) {
  			int error = check_acl(inode, mask);
  			if (error == -EACCES)
  				goto check_capabilities;
  			else if (error != -EAGAIN)
  				return error;
  		}
  
  		if (in_group_p(inode->i_gid))
  			mode >>= 3;
  	}
+ unknown_user:
  
  	/*
  	 * If the DACs are ok we don't need any capability check.
  	 */
  	if (((mode & mask & (MAY_READ|MAY_WRITE|MAY_EXEC)) == mask))
  		return 0;
  
   check_capabilities:
  	/*
  	 * Read/write DACs are always overridable.
  	 * Executable DACs are overridable if at least one exec bit is set.
  	 */
  	if (!(mask & MAY_EXEC) ||
  	    (inode->i_mode & S_IXUGO) || S_ISDIR(inode->i_mode))
  		if (capable(CAP_DAC_OVERRIDE))
  			return 0;
  
  	/*
  	 * Searching includes executable on directories, else just read.
  	 */
  	if (mask == MAY_READ || (S_ISDIR(inode->i_mode) && !(mask & MAY_WRITE)))
  		if (capable(CAP_DAC_READ_SEARCH))
  			return 0;
  
  	return -EACCES;
  }


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 17:36                                     ` Eric W. Biederman
@ 2006-07-14 18:15                                       ` Trond Myklebust
  2006-07-14 18:40                                         ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Trond Myklebust @ 2006-07-14 18:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Hansen, Serge E. Hallyn, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

On Fri, 2006-07-14 at 11:36 -0600, Eric W. Biederman wrote:
> Dave Hansen <haveblue@us.ibm.com> writes:
> 
> > On Fri, 2006-07-14 at 12:08 -0500, Serge E. Hallyn wrote:
> >> yes, of course, vfsmount, which I assume is what Eric meant?
> >> 
> >> Which means we'd have to do this at permission() using the nameidata, or
> >> pass nd to generic_permission. 
> >
> > Yeah, I think so.  But, this is well into Al territory, and there might
> > be a better way.
> 
> Well until we get that sorted out I will keep picking on i_sb.

Don't bother: labelling superblocks with process-specific data is always
going to be unacceptable.

In order to avoid aliased superblocks, you would have to be able
guarantee to be the sole owner of the data on the device that it refers
to. You'd have to own the device in order to do that, in which case you
are better off just labelling the device instead.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 18:15                                       ` Trond Myklebust
@ 2006-07-14 18:40                                         ` Eric W. Biederman
  2006-07-14 21:04                                           ` Trond Myklebust
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 18:40 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Dave Hansen, Serge E. Hallyn, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

Trond Myklebust <trond.myklebust@fys.uio.no> writes:

> On Fri, 2006-07-14 at 11:36 -0600, Eric W. Biederman wrote:
>> Dave Hansen <haveblue@us.ibm.com> writes:
>> 
>> > On Fri, 2006-07-14 at 12:08 -0500, Serge E. Hallyn wrote:
>> >> yes, of course, vfsmount, which I assume is what Eric meant?
>> >> 
>> >> Which means we'd have to do this at permission() using the nameidata, or
>> >> pass nd to generic_permission. 
>> >
>> > Yeah, I think so.  But, this is well into Al territory, and there might
>> > be a better way.
>> 
>> Well until we get that sorted out I will keep picking on i_sb.
>
> Don't bother: labelling superblocks with process-specific data is always
> going to be unacceptable.

It's not process specific data.  It is a pointer to global context in
which uid's on the filesystem uniquely specify a user.  This is
something that would get set when the filesystem is mounted.

> In order to avoid aliased superblocks, you would have to be able
> guarantee to be the sole owner of the data on the device that it refers
> to. You'd have to own the device in order to do that, in which case you
> are better off just labelling the device instead.

Now I do agree if I can set the information in vfsmount and not in
the superblock it is probably better.  But even with nfs mount superblock
collapsing (which I almost understand) I don't see it as a real
problem, as long as I could prevent the superblock from collapsing.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 18:40                                         ` Eric W. Biederman
@ 2006-07-14 21:04                                           ` Trond Myklebust
  2006-07-15  4:09                                             ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Trond Myklebust @ 2006-07-14 21:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Hansen, Serge E. Hallyn, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

On Fri, 2006-07-14 at 12:40 -0600, Eric W. Biederman wrote:
> Now I do agree if I can set the information in vfsmount and not in
> the superblock it is probably better.  But even with nfs mount superblock
> collapsing (which I almost understand) I don't see it as a real
> problem, as long as I could prevent the superblock from collapsing.

NFS is the least of your problems. You can only have one superblock for
most local filesystems too and with good reason: imagine, for instance,
the effect of having 2 different block allocators working on the same
device.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 21:04                                           ` Trond Myklebust
@ 2006-07-15  4:09                                             ` Eric W. Biederman
  2006-07-15  4:35                                               ` Kyle Moffett
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-15  4:09 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Dave Hansen, Serge E. Hallyn, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

Trond Myklebust <trond.myklebust@fys.uio.no> writes:

> On Fri, 2006-07-14 at 12:40 -0600, Eric W. Biederman wrote:
>> Now I do agree if I can set the information in vfsmount and not in
>> the superblock it is probably better.  But even with nfs mount superblock
>> collapsing (which I almost understand) I don't see it as a real
>> problem, as long as I could prevent the superblock from collapsing.
>
> NFS is the least of your problems. You can only have one superblock for
> most local filesystems too and with good reason: imagine, for instance,
> the effect of having 2 different block allocators working on the same
> device.

Let me try to explain the idea again.

Currently there is a global context in which we interpret uids.
But different machines can have different global contexts.

Each filesystem to be sane needs to store uids from only one such
context.  For network filesystems typicall the context is extended to
multiple machines so that everyone who mounts a filesystem will
interpret a uid with the same meaning.

The idea of creating multiple a user id namespaces on a single machine
creates multiple contexts for the interpretation of uid values on
the same machine.  Allowing a single id to refer to different
users depending on the context in which it is interpreted.

I can think of no circumstance in which a single filesystem
will have multiple contexts in which user id's will be interpreted.
Nor can I think of a sane scenario in which that would occur.

Given the fact that we are referring to a global property of a filesystem
why is it fundamentally a problem to put it in the superblock?

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-15  4:09                                             ` Eric W. Biederman
@ 2006-07-15  4:35                                               ` Kyle Moffett
  2006-07-15 12:35                                                 ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Kyle Moffett @ 2006-07-15  4:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Trond Myklebust, Dave Hansen, Serge E. Hallyn, Cedric Le Goater,
	linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain

On Jul 15, 2006, at 00:09:50, Eric W. Biederman wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> writes:
>> NFS is the least of your problems. You can only have one  
>> superblock for most local filesystems too and with good reason:  
>> imagine, for instance, the effect of having 2 different block  
>> allocators working on the same device.
>
> Let me try to explain the idea again.
>
> Currently there is a global context in which we interpret uids.   
> But different machines can have different global contexts.
>
> Each filesystem to be sane needs to store uids from only one such  
> context.  For network filesystems typicall the context is extended  
> to multiple machines so that everyone who mounts a filesystem will  
> interpret a uid with the same meaning.
>
> The idea of creating multiple a user id namespaces on a single  
> machine creates multiple contexts for the interpretation of uid  
> values on the same machine.  Allowing a single id to refer to  
> different users depending on the context in which it is interpreted.
>
> I can think of no circumstance in which a single filesystem will  
> have multiple contexts in which user id's will be interpreted. Nor  
> can I think of a sane scenario in which that would occur.
>
> Given the fact that we are referring to a global property of a  
> filesystem why is it fundamentally a problem to put it in the  
> superblock?

Here's a possible example:

I have one disk which I want to share between multiple virtualized  
instances for root filesystems.  I bind-mount /onedisk/foo as the foo  
virtual machine's root and /onedisk/bar as the bar virtual machine's  
root.  There should (must) be two interpretations of the linear UID  
space on that disk, one for the foo virtual machine, and one for the  
bar virtual machine.  By allowing the administrator to determine UID  
namespace per-vfsmount, you make such an arrangement possible where  
it otherwise would not be.

With NFS and the proposed superblock-sharing patches (necessary for  
efficiency and other reasons I don't entirely understand), the  
situation is worse:  A mount of server:/foo/bar on / in the bar  
virtual machine may get its superblock merged with a mount of server:/ 
foo/baz on / in the baz virtual machine.  If it's efficient to merge  
those superblocks we should, and once again it's necessary to tie the  
UID namespace to the vfsmount, not the superblock.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-15  4:35                                               ` Kyle Moffett
@ 2006-07-15 12:35                                                 ` Eric W. Biederman
  2006-07-15 13:25                                                   ` Kyle Moffett
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-15 12:35 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Trond Myklebust, Dave Hansen, Serge E. Hallyn, Cedric Le Goater,
	linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain

Kyle Moffett <mrmacman_g4@mac.com> writes:

> Here's a possible example:
>
> I have one disk which I want to share between multiple virtualized  instances
> for root filesystems.  I bind-mount /onedisk/foo as the foo  virtual machine's
> root and /onedisk/bar as the bar virtual machine's  root.  There should (must)
> be two interpretations of the linear UID  space on that disk, one for the foo
> virtual machine, and one for the  bar virtual machine.  By allowing the
> administrator to determine UID  namespace per-vfsmount, you make such an
> arrangement possible where  it otherwise would not be.

Yes.

With the scenario you describe there is a confusing case of how do
you interpret uids on the /onedisk mount.  uid mapping may be a more
appropriate strategy to remove all confusion there.

> With NFS and the proposed superblock-sharing patches (necessary for  efficiency
> and other reasons I don't entirely understand), the  situation is worse:  A
> mount of server:/foo/bar on / in the bar virtual machine may get its superblock
> merged with a mount of server:/ foo/baz on / in the baz virtual machine.  If
> it's efficient to merge  those superblocks we should, and once again it's
> necessary to tie the  UID namespace to the vfsmount, not the
> superblock.

I completely agree that pushing nameidata down into generic_permission
where we can use per mount properties in our permission checks is
ideal.  The benefit I see is just a small increase in flexibility.
So I don't really care either way.

Currently there are several additional flags that could benefit
from a per vfsmount interpretation as well:  nosuid, noexec, nodev,
and readonly, how do we handle those?

noexec is on the vfsmount.
nosuid is on the vfsmount
nodev  is on the vfsmount
readonly is not on the vfsmount.

The existing precedent is clearly in favor of putting this kind of
information on the vfsmount.  The read-only attribute seems to
be the only hold out.  If readonly has deep implications like
no journal replay it makes sense to keep it per mount.  Which
indicates we could nose a nowrite option to express the per
vfsmount property.

I hope the confusion has passed for Trond.  My impression was he
figured this was per process data so it didn't make sense any where
near a filesystem, and the superblock was the last place it should
be.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-15 12:35                                                 ` Eric W. Biederman
@ 2006-07-15 13:25                                                   ` Kyle Moffett
  2006-07-15 15:54                                                   ` Dave Hansen
  2006-07-15 17:01                                                   ` Trond Myklebust
  2 siblings, 0 replies; 107+ messages in thread
From: Kyle Moffett @ 2006-07-15 13:25 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Trond Myklebust, Dave Hansen, Serge E. Hallyn, Cedric Le Goater,
	linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain

On Jul 15, 2006, at 08:35:18, Eric W. Biederman wrote:
> Kyle Moffett <mrmacman_g4@mac.com> writes:
>> With NFS and the proposed superblock-sharing patches (necessary  
>> for  efficiency and other reasons I don't entirely understand),  
>> the  situation is worse:  A mount of server:/foo/bar on / in the  
>> bar virtual machine may get its superblock merged with a mount of  
>> server:/ foo/baz on / in the baz virtual machine.  If it's  
>> efficient to merge  those superblocks we should, and once again  
>> it's necessary to tie the  UID namespace to the vfsmount, not the  
>> superblock.
>
> I completely agree that pushing nameidata down into  
> generic_permission where we can use per mount properties in our  
> permission checks is ideal.  The benefit I see is just a small  
> increase in flexibility. So I don't really care either way.
>
> Currently there are several additional flags that could benefit  
> from a per vfsmount interpretation as well:  nosuid, noexec, nodev,  
> and readonly, how do we handle those?
>
> noexec is on the vfsmount.
> nosuid is on the vfsmount
> nodev  is on the vfsmount
> readonly is not on the vfsmount.
>
> The existing precedent is clearly in favor of putting this kind of  
> information on the vfsmount.  The read-only attribute seems to be  
> the only hold out.  If readonly has deep implications like no  
> journal replay it makes sense to keep it per mount.  Which  
> indicates we could nose a nowrite option to express the per  
> vfsmount property.

Well, speaking of that; there's been another thread recently that's  
splitting the properties of read-only between vfsmount and  
superblock.  So a read-only superblock implies read-only vfsmounts,  
but the following can create a read-only vfsmount for a writable  
superblock:

   mount --bind / /mnt/read-only-root
   mount -o ro,remount /mnt/read-only-root

So the readonly special case will go away.

> I hope the confusion has passed for Trond.  My impression was he  
> figured this was per process data so it didn't make sense any where  
> near a filesystem, and the superblock was the last place it should be.

One of the things I said earlier in this thread is that "Both  
filesystems _and_ processes should be first-class objects in any UID  
namespace".  In order to have sufficient access controls in the  
presence of _only_ a UID-namespace (as opposed to with full container  
isolation), you need to check against an object *and* the namespace  
in which it is located.  In some cases, the object is a file, which  
means that the inode, vfsmount, or superblock need a UID namespace  
reference.  Theoretically a you could implement per-file UID  
namespace pointers, but that would probably be incredibly  
inefficient.  IMHO, per-vfsmount gives the best flexibility and  
efficiency of the three.

In fact, it's strange to think about this in context with the rest of  
the namespaces that are being designed, but processes would  
ordinarily *not* have primary presence in a UID namespace if they  
weren't the target of UID-verified operations in and of themselves  
(EX: kill, ptrace, etc).  Otherwise they would just have a set of  
(namespace,UID,cap_flags) pairs to give them access to filesystems in  
specific uid namespaces.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-15 12:35                                                 ` Eric W. Biederman
  2006-07-15 13:25                                                   ` Kyle Moffett
@ 2006-07-15 15:54                                                   ` Dave Hansen
  2006-07-15 17:01                                                   ` Trond Myklebust
  2 siblings, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2006-07-15 15:54 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kyle Moffett, Trond Myklebust, Serge E. Hallyn, Cedric Le Goater,
	linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain

On Sat, 2006-07-15 at 06:35 -0600, Eric W. Biederman wrote:
> Currently there are several additional flags that could benefit
> from a per vfsmount interpretation as well:  nosuid, noexec, nodev,
> and readonly, how do we handle those?
> 
> noexec is on the vfsmount.
> nosuid is on the vfsmount
> nodev  is on the vfsmount
> readonly is not on the vfsmount.

I can help with one of them:

http://www.opensubscriber.com/message/linux-kernel@vger.kernel.org/4437187.html

A rollup is here (it includes other things, though)

http://www.sr71.net/patches/2.6.18/2.6.18-rc1-mm1-lxc4/

-- Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-15 12:35                                                 ` Eric W. Biederman
  2006-07-15 13:25                                                   ` Kyle Moffett
  2006-07-15 15:54                                                   ` Dave Hansen
@ 2006-07-15 17:01                                                   ` Trond Myklebust
  2006-07-15 23:29                                                     ` Eric W. Biederman
  2 siblings, 1 reply; 107+ messages in thread
From: Trond Myklebust @ 2006-07-15 17:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kyle Moffett, Dave Hansen, Serge E. Hallyn, Cedric Le Goater,
	linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain

On Sat, 2006-07-15 at 06:35 -0600, Eric W. Biederman wrote:
> I hope the confusion has passed for Trond.  My impression was he
> figured this was per process data so it didn't make sense any where
> near a filesystem, and the superblock was the last place it should
> be.

You are still using the wrong abstraction. Data that is not global to
the entire machine has absolutely _no_ place being put into the
superblock. It doesn't matter if it is process-specific,
container-specific or whatever-else-specific, it will still be vetoed.

If your real problem is uid/gid mapping on top of generic filesystems,
then have you looked into the *BSD solution of using a stackable
filesystem (i.e. umapfs)?

  Trond


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-15 17:01                                                   ` Trond Myklebust
@ 2006-07-15 23:29                                                     ` Eric W. Biederman
  2006-07-16 16:18                                                       ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-15 23:29 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Kyle Moffett, Dave Hansen, Serge E. Hallyn, Cedric Le Goater,
	linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain

Trond Myklebust <trond.myklebust@fys.uio.no> writes:

> On Sat, 2006-07-15 at 06:35 -0600, Eric W. Biederman wrote:
>> I hope the confusion has passed for Trond.  My impression was he
>> figured this was per process data so it didn't make sense any where
>> near a filesystem, and the superblock was the last place it should
>> be.
>
> You are still using the wrong abstraction. Data that is not global to
> the entire machine has absolutely _no_ place being put into the
> superblock. It doesn't matter if it is process-specific,
> container-specific or whatever-else-specific, it will still be vetoed.

Sure, I have no problem with only global data, and filesystem specific
data being in a super block.  In this case my impression is that the
data is at least arguably filesystem specific.  filesystem-specific
data is ok on the super block is it not?

> If your real problem is uid/gid mapping on top of generic filesystems,
> then have you looked into the *BSD solution of using a stackable
> filesystem (i.e. umapfs)?

I haven't and it sounds reasonable to look at.  As far as I know BSDs
don't have my specific problem.  uid mapping is simply a tool for
dealing with the problem, not the problem itself.  A stackable
filesystem is a reasonable alternative to using security keys to
do the mapping.

My real problem is that there is a good case for uids that are not
global to a machine.  The discussion is simply how do you cope with
that.

Now I do believe that there is a good case for uids being global to a
filesystem and all I was really talking about was a tag that marked
which parts of the entire system that used the same uids as the
filesystem.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-15 23:29                                                     ` Eric W. Biederman
@ 2006-07-16 16:18                                                       ` Dave Hansen
  0 siblings, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2006-07-16 16:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Trond Myklebust, Kyle Moffett, Serge E. Hallyn, Cedric Le Goater,
	linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain

On Sat, 2006-07-15 at 17:29 -0600, Eric W. Biederman wrote:
> Trond Myklebust <trond.myklebust@fys.uio.no> writes:
> > You are still using the wrong abstraction. Data that is not global to
> > the entire machine has absolutely _no_ place being put into the
> > superblock. It doesn't matter if it is process-specific,
> > container-specific or whatever-else-specific, it will still be vetoed.
> 
> Sure, I have no problem with only global data, and filesystem specific
> data being in a super block.  In this case my impression is that the
> data is at least arguably filesystem specific.  filesystem-specific
> data is ok on the super block is it not?

Yes, it is OK if you can win that argument. ;)  I'll let you take that
one up with Al.  You're just saying that since there is a possibility of
an argument, it should be OK to put the data in the sb?  

I think the uid mapping properties are OK to keep in the superblock if
and only if there is only one possible way to interpret the uids on the
disk into the uids that are seen in userspace.

This is the same argument for putting things like mnt_root in the
vfsmount.  Is there only one possible place in the filesystem in which a
given superblock may be mounted?  No.  So, it goes in the vfsmount.

I personally trust people like Trond's judgment on this stuff like this.
He's been messing with the VFS much longer than I.  Eric, would you like
to start another thread on this topic, and CC the main people from this
thread, and include Al?

Or, is this something we should address briefly at the kernel summit?

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 16:55                               ` Dave Hansen
  2006-07-14 17:08                                 ` Serge E. Hallyn
@ 2006-07-14 17:14                                 ` Eric W. Biederman
  2006-07-16  8:36                                 ` Kirill Korotaev
  2 siblings, 0 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 17:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

Dave Hansen <haveblue@us.ibm.com> writes:

> On Fri, 2006-07-14 at 10:49 -0600, Eric W. Biederman wrote:
>> -       if (current->fsuid == inode->i_uid)
>> +       if ((current->fsuid == inode->i_uid) &&
>> +               (current->nsproxy->user_ns == inode->i_sb->user_ns))
>>                 mode >>= 6; 
>
> I really don't think assigning a user namespace to a superblock is the
> right way to go.  Seems to me like the _view_ into the filesystem is
> what you want to modify.  That would seem to me to mean that each
> 'struct namespace' (filesystem namespace) or vfsmount would be assigned
> a corresponding user namespace, *not* the superblock.

I guess since you can't bind mount across filesystem namespaces looking
at the vfsmount is ok, and more flexible.

But inode->i_sb->user_ns or nd->mnt->user_ns isn't nearly as important
as simply comparing some the appropriate user_ns values.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 16:55                               ` Dave Hansen
  2006-07-14 17:08                                 ` Serge E. Hallyn
  2006-07-14 17:14                                 ` Eric W. Biederman
@ 2006-07-16  8:36                                 ` Kirill Korotaev
  2006-07-16 10:08                                   ` Eric W. Biederman
  2 siblings, 1 reply; 107+ messages in thread
From: Kirill Korotaev @ 2006-07-16  8:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Eric W. Biederman, Serge E. Hallyn, Cedric Le Goater,
	linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain

>>-       if (current->fsuid == inode->i_uid)
>>+       if ((current->fsuid == inode->i_uid) &&
>>+               (current->nsproxy->user_ns == inode->i_sb->user_ns))
>>                mode >>= 6; 
> 
> 
> I really don't think assigning a user namespace to a superblock is the
> right way to go.  Seems to me like the _view_ into the filesystem is
> what you want to modify.  That would seem to me to mean that each
> 'struct namespace' (filesystem namespace) or vfsmount would be assigned
> a corresponding user namespace, *not* the superblock.
I dislike this way either. We need an ability to have an access to container
filesystems and data from the host.
such a strong checks break this.

Thanks.
Kirill

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-16  8:36                                 ` Kirill Korotaev
@ 2006-07-16 10:08                                   ` Eric W. Biederman
  0 siblings, 0 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-16 10:08 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Dave Hansen, Serge E. Hallyn, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

Kirill Korotaev <dev@sw.ru> writes:

>>>-       if (current->fsuid == inode->i_uid)
>>>+       if ((current->fsuid == inode->i_uid) &&
>>>+               (current->nsproxy->user_ns == inode->i_sb->user_ns))
>>>                mode >>= 6;
>> I really don't think assigning a user namespace to a superblock is the
>> right way to go.  Seems to me like the _view_ into the filesystem is
>> what you want to modify.  That would seem to me to mean that each
>> 'struct namespace' (filesystem namespace) or vfsmount would be assigned
>> a corresponding user namespace, *not* the superblock.
> I dislike this way either. We need an ability to have an access to container
> filesystems and data from the host.
> such a strong checks break this.

No this check doesn't.  CAP_DAC_OVERRIDE still works.

Of course enter a running container is a subject for major debate.

Eric


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 16:49                             ` Eric W. Biederman
  2006-07-14 16:55                               ` Dave Hansen
@ 2006-07-14 17:05                               ` Serge E. Hallyn
  2006-07-14 17:50                                 ` Kyle Moffett
  2006-07-14 17:56                                 ` Eric W. Biederman
  1 sibling, 2 replies; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-14 17:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Dave Hansen, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> Dave Hansen <haveblue@us.ibm.com> writes:
> >> 
> >> > On Thu, 2006-07-13 at 21:45 -0600, Eric W. Biederman wrote:
> >> >> I think for filesystems like /proc and /sys that there will normally
> >> >> be problems.  However many of those problems can be rationalized away
> >> >> as a reasonable optimization, or are not immediately apparent.
> >> >
> >> > Could you talk about some of these problems?
> >> 
> >> Already mentioned but.  rw permissions on sensitive files are for 
> >> uid == 0.  No capability checks are performed.
> >
> > As Herbert (IIRC) pointed out that could/should be fixed.
> 
> Capabilities have always fitted badly in with the normal unix
> permissions.

Well they're not supposed to fit in.

If we keep permchecks as uid==0 on files which invoke kernel callbacks,
then we can only say once what root is allowed to do.  If we make them
capability checks, then for differnet uses of namespaces we can have
them do different things.  For instance if we're making a separate user
namespace for a checkpoint/restart purpose, we might want root to retain
more privs than if we're making a vserver.

Look I just have to keep responding because you keep provoking :), but
I'm looking at other code and don't even know which entries we're
talking about.  If when I get to looking at them I find they really
should be done by capabilities, I'll submit a patch and we can argue
then.

> So if we have a solution that works nicely with normal
> unix permissions we will have a nice general solution, that is
> easy for people to understand.
> 
> What I am talking about is making a small tweak to the permission
> checking as below.  Why do you keep avoiding even considering it?

Not only don't I avoid considering it, I thought I'd even suggested it
a while ago  :)

It sounds good to me.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 17:05                               ` Serge E. Hallyn
@ 2006-07-14 17:50                                 ` Kyle Moffett
  2006-07-15 11:33                                   ` Serge E. Hallyn
  2006-07-14 17:56                                 ` Eric W. Biederman
  1 sibling, 1 reply; 107+ messages in thread
From: Kyle Moffett @ 2006-07-14 17:50 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Dave Hansen, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

On Jul 14, 2006, at 13:05:23, Serge E. Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Capabilities have always fitted badly in with the normal unix  
>> permissions.
>
> Well they're not supposed to fit in.
>
> If we keep permchecks as uid==0 on files which invoke kernel  
> callbacks, then we can only say once what root is allowed to do.   
> If we make them capability checks, then for differnet uses of  
> namespaces we can have them do different things.  For instance if  
> we're making a separate user namespace for a checkpoint/restart  
> purpose, we might want root to retain more privs than if we're  
> making a vserver.
>
> Look I just have to keep responding because you keep provoking :),  
> but I'm looking at other code and don't even know which entries  
> we're talking about.  If when I get to looking at them I find they  
> really should be done by capabilities, I'll submit a patch and we  
> can argue then.

Capabilities are not a single fundamental privilege set and they  
don't interact at all nicely with virtualization.  If we're going to  
do this properly we need to divide capabilities up into different  
sets applied to different objects.  For example, the CAP_DAC_OVERRIDE  
capability should _really_ be per (PID,vfsmount) pair, although that  
would probably be exceptionally inefficient, so as an optimization we  
could make it per (PID,target_uid_ns) pair.  That maps very  
conveniently to the kernel keyring system, where we can make a "uid"  
keytype indexed by target_uid_ns and containing three things:  A  
process-manipulation UID (think euid), a file-manipulation UID (think  
fsuid), and a set of per-uid-ns capabilities.  Similar mappings  
should be set up for most of the other capabilities.  Normal cap_set  
and cap_get and capable() calls on those particular capabilities  
should look up and modify the "uid" key associated with the current- 
 >nsproxy->uidns namespace, and extra calls should be added to modify  
capabilities with respect to specific UID namespaces.

Here's a list of capabilities that should probably be in each "uid" key:
CAP_CHOWN
CAP_DAC_OVERRIDE
CAP_DAC_READ_SEARCH
CAP_FOWNER
CAP_FSETID
CAP_FS_MASK
CAP_SETGID
CAP_SETUID
CAP_LINUX_IMMUTABLE
CAP_IPC_OWNER

Cheers,
Kyle Moffett




^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 17:50                                 ` Kyle Moffett
@ 2006-07-15 11:33                                   ` Serge E. Hallyn
  0 siblings, 0 replies; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-15 11:33 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Serge E. Hallyn, Eric W. Biederman, Dave Hansen, Cedric Le Goater,
	linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain

Quoting Kyle Moffett (mrmacman_g4@mac.com):
> On Jul 14, 2006, at 13:05:23, Serge E. Hallyn wrote:
> >Quoting Eric W. Biederman (ebiederm@xmission.com):
> >>Capabilities have always fitted badly in with the normal unix  
> >>permissions.
> >
> >Well they're not supposed to fit in.
> >
> >If we keep permchecks as uid==0 on files which invoke kernel  
> >callbacks, then we can only say once what root is allowed to do.   
> >If we make them capability checks, then for differnet uses of  
> >namespaces we can have them do different things.  For instance if  
> >we're making a separate user namespace for a checkpoint/restart  
> >purpose, we might want root to retain more privs than if we're  
> >making a vserver.
> >
> >Look I just have to keep responding because you keep provoking :),  
> >but I'm looking at other code and don't even know which entries  
> >we're talking about.  If when I get to looking at them I find they  
> >really should be done by capabilities, I'll submit a patch and we  
> >can argue then.
> 
> Capabilities are not a single fundamental privilege set and they  

But they are exactly that.

> don't interact at all nicely with virtualization.  If we're going to  

There are three ways caps can deal with virtualization.

First, we can stick with the idea that a child process inherits
capabilities from the parent.  In that case, while starting up a child
container, we give the init process the caps we want it to have (i.e.
perhaps we just don't give it CAP_NET_ADMIN).  It can then dole those
out as usual, and it's children then can't regain the others.  Problem
is setuid binaries are a special case.  Though if we implement fs caps,
we could then put a max_caps limit on the fs caps for a vfsmount.

Second, we can actually create a separate capabilities namespace  :)
Sounds silly, but when you consider that you're recommending passing
several namespaces into capable() to have some magic calculation
performed on them, maybe it's not so bad.  This way, an suid binary
ends up with just the capabilities specified in the namespace.

Third is to do what you suggest below.  Fortunately that's easy enough
to experiment with, since we can just define a new capability_virt LSM.
I don't think any of the capable calls need to be extended, since we can
grab the namespace pointers needed from the task_struct->nsproxy's.

IMO, capabilities are already complicated past the point of danger, and
the solution keeping things simplest is the second.  Since capabilities
are an LSM, the namespace creation and configuration would likely be
done differently from other namespace, i.e. through a securityfs
interface.

This is of course something which could fit under the security/LSM
portion of the containers topic at ksummit...  Along with the uid/keys
discussion.

> do this properly we need to divide capabilities up into different  
> sets applied to different objects.  For example, the CAP_DAC_OVERRIDE  
> capability should _really_ be per (PID,vfsmount) pair, although that  
> would probably be exceptionally inefficient, so as an optimization we  
> could make it per (PID,target_uid_ns) pair.  That maps very  
> conveniently to the kernel keyring system, where we can make a "uid"  
> keytype indexed by target_uid_ns and containing three things:  A  
> process-manipulation UID (think euid), a file-manipulation UID (think  
> fsuid), and a set of per-uid-ns capabilities.  Similar mappings  
> should be set up for most of the other capabilities.  Normal cap_set  
> and cap_get and capable() calls on those particular capabilities  
> should look up and modify the "uid" key associated with the current- 
> >nsproxy->uidns namespace, and extra calls should be added to modify  
> capabilities with respect to specific UID namespaces.
> 
> Here's a list of capabilities that should probably be in each "uid" key:
> CAP_CHOWN
> CAP_DAC_OVERRIDE
> CAP_DAC_READ_SEARCH
> CAP_FOWNER
> CAP_FSETID
> CAP_FS_MASK
> CAP_SETGID
> CAP_SETUID
> CAP_LINUX_IMMUTABLE
> CAP_IPC_OWNER

But this gets away from the idea of capabilities being an inherited set
of privileges independent of uid.  Really it should be file
capabilities, not (set)uid, which have the ability subsequently raise
caps.

Of course few people love capabilities, so maybe that would actually be
preferred...

> Cheers,
> Kyle Moffett

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 17:05                               ` Serge E. Hallyn
  2006-07-14 17:50                                 ` Kyle Moffett
@ 2006-07-14 17:56                                 ` Eric W. Biederman
  1 sibling, 0 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 17:56 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>> 
>> Capabilities have always fitted badly in with the normal unix
>> permissions.
>
> Well they're not supposed to fit in.

A better way to phrase that is to say the don't fit into the unix
culture well.

> If we keep permchecks as uid==0 on files which invoke kernel callbacks,
> then we can only say once what root is allowed to do.  If we make them
> capability checks, then for differnet uses of namespaces we can have
> them do different things.  For instance if we're making a separate user
> namespace for a checkpoint/restart purpose, we might want root to retain
> more privs than if we're making a vserver.

Yes. :)

> Look I just have to keep responding because you keep provoking :), but
> I'm looking at other code and don't even know which entries we're
> talking about.  If when I get to looking at them I find they really
> should be done by capabilities, I'll submit a patch and we can argue
> then.

Roughly the set can be found with:
find /proc/ -type f -uid 0 -perm -0200 ! '(' -path '/proc/[0-9]*' ')' ! '(' -perm 0022 ')'
find /sys/ -type f -uid 0 -perm -0200 !  ! '(' -perm 0022 ')'

Very few writable files in /proc or sysfs do any sort of capability checking.

In essence every little file by every little driver out there has
this problem.  

The unix permission way to handle this would be to chown these files
after bootup to the appropriate users (because the permissions cannot
be stored in the filesystem).

I have a hard time believing that chasing ever little driver is going to 
be a tractable solution to this problem.

>> So if we have a solution that works nicely with normal
>> unix permissions we will have a nice general solution, that is
>> easy for people to understand.
>> 
>> What I am talking about is making a small tweak to the permission
>> checking as below.  Why do you keep avoiding even considering it?
>
> Not only don't I avoid considering it, I thought I'd even suggested it
> a while ago  :)

Sorry. Too many email messages that just seemed to miss the point,
when we had that miscommunication...

> It sounds good to me.

Cool.

Eric


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 15:13                         ` Eric W. Biederman
  2006-07-14 16:29                           ` Serge E. Hallyn
@ 2006-07-14 16:35                           ` Dave Hansen
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Hansen @ 2006-07-14 16:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

On Fri, 2006-07-14 at 09:13 -0600, Eric W. Biederman wrote:
> Already mentioned but.  rw permissions on sensitive files are for 
> uid == 0.  No capability checks are performed. 

I'd also like to point out that we can do this whole process very
incrementally.  If we want, we can start out with all containers
chroot'd into a filesystem namespace which uses read-only bind mounts
and doesn't allow any write access to the underlying filesystem.

As we introduce the capability checks, we can start to actually
cooperatively share more data.

-- Dave

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 19:02                 ` Eric W. Biederman
  2006-07-13 20:03                   ` Dave Hansen
@ 2006-07-13 21:41                   ` Serge E. Hallyn
  2006-07-14  3:52                     ` Eric W. Biederman
  1 sibling, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-13 21:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Hansen, Serge E. Hallyn, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Dave Hansen <haveblue@us.ibm.com> writes:
> 
> > On Thu, 2006-07-13 at 12:14 -0600, Eric W. Biederman wrote:
> >> Maybe.  I really think the sane semantics are in a different uid namespace.
> >> So you can't assumes uids are the same.  Otherwise you can't handle open
> >> file descriptors or files passed through unix domain sockets.
> >
> > Eric, could you explain this a little bit more?  I'm not sure I
> > understand the details of why this is a problem?
> 
> Very simply.
> 
> In the presence of a user namespace.  
> All comparisons of a user equality need to be of the tuple (user namespace, user id).
> Any comparison that does not do that is an optimization.
> 
> Because you can have access to files created in another user namespace it
> is very unlikely that optimization will apply very frequently.  The easy scenario
> to get access to a file descriptor from another context is to consider unix
> domain sockets.

What does that have to do with uids?  If you receive an fd, uids don't
matter in any case.  The only permission checks which happen are LSM
hooks, which should be uid-agnostic.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 21:41                   ` Serge E. Hallyn
@ 2006-07-14  3:52                     ` Eric W. Biederman
  2006-07-14 14:02                       ` Serge E. Hallyn
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14  3:52 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Dave Hansen <haveblue@us.ibm.com> writes:
>> 
>> > On Thu, 2006-07-13 at 12:14 -0600, Eric W. Biederman wrote:
>> >> Maybe.  I really think the sane semantics are in a different uid namespace.
>> >> So you can't assumes uids are the same.  Otherwise you can't handle open
>> >> file descriptors or files passed through unix domain sockets.
>> >
>> > Eric, could you explain this a little bit more?  I'm not sure I
>> > understand the details of why this is a problem?
>> 
>> Very simply.
>> 
>> In the presence of a user namespace.  
>> All comparisons of a user equality need to be of the tuple (user namespace,
> user id).
>> Any comparison that does not do that is an optimization.
>> 
>> Because you can have access to files created in another user namespace it
>> is very unlikely that optimization will apply very frequently.  The easy
> scenario
>> to get access to a file descriptor from another context is to consider unix
>> domain sockets.
>
> What does that have to do with uids?  If you receive an fd, uids don't
> matter in any case.  The only permission checks which happen are LSM
> hooks, which should be uid-agnostic.

You are guest uid 0.  You get a directory file descriptor from another namespace.
You call fchdir.

If you permission checks are not (user namespace, uid) what can't you do?

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14  3:52                     ` Eric W. Biederman
@ 2006-07-14 14:02                       ` Serge E. Hallyn
  2006-07-14 14:50                         ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-14 14:02 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Dave Hansen, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> Dave Hansen <haveblue@us.ibm.com> writes:
> >> 
> >> > On Thu, 2006-07-13 at 12:14 -0600, Eric W. Biederman wrote:
> >> >> Maybe.  I really think the sane semantics are in a different uid namespace.
> >> >> So you can't assumes uids are the same.  Otherwise you can't handle open
> >> >> file descriptors or files passed through unix domain sockets.
> >> >
> >> > Eric, could you explain this a little bit more?  I'm not sure I
> >> > understand the details of why this is a problem?
> >> 
> >> Very simply.
> >> 
> >> In the presence of a user namespace.  
> >> All comparisons of a user equality need to be of the tuple (user namespace,
> > user id).
> >> Any comparison that does not do that is an optimization.
> >> 
> >> Because you can have access to files created in another user namespace it
> >> is very unlikely that optimization will apply very frequently.  The easy
> > scenario
> >> to get access to a file descriptor from another context is to consider unix
> >> domain sockets.
> >
> > What does that have to do with uids?  If you receive an fd, uids don't
> > matter in any case.  The only permission checks which happen are LSM
> > hooks, which should be uid-agnostic.
> 
> You are guest uid 0.  You get a directory file descriptor from another namespace.
> You call fchdir.
> 
> If you permission checks are not (user namespace, uid) what can't you do?

File descripters can only be passed over a unix socket, right?

So this seems to fall into the same "userspace should set things up
sanely" argument you've brought up before.

Don't get me wrong though - the idea of using in-kernel keys as
cross-namespace uid's is definately interesting.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 14:02                       ` Serge E. Hallyn
@ 2006-07-14 14:50                         ` Eric W. Biederman
  2006-07-14 16:39                           ` Serge E. Hallyn
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 14:50 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> If you permission checks are not (user namespace, uid) what can't you do?
>
> File descripters can only be passed over a unix socket, right?

No.  You can use /proc to do the same thing.  You can inherit
file descriptors, etc.  This isn't a door you can just close and ignore.
It is too easy to do this to assume you have closed every possible
way to get a descriptor from outside of ``container''.

Suppose you have user fred uid 1000 outside of any containers,
and you have user joe uid 1000 inside user uid namespace.  If you don't
make your permission checks (user namespace, uid) fred can do terrible
things to joe.

If I we don't do the expanded permission checks the only alternative
is to check to see if every object is in our ``container'' at every
access.  That isn't something I want to do.

I don't intend to partition objects just partition object look ups by name.
Which means that if you very carefully setup your initial process you
will never be able to find an object outside of your namespace.  But
the kernel never should assume user space has done that.

> So this seems to fall into the same "userspace should set things up
> sanely" argument you've brought up before.

For using it yes.  But not for correct kernel operation.

> Don't get me wrong though - the idea of using in-kernel keys as
> cross-namespace uid's is definately interesting.

That is their role in the kernel.  A flavor of key to handle uid
mapping needs to be added, but that is all.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 14:50                         ` Eric W. Biederman
@ 2006-07-14 16:39                           ` Serge E. Hallyn
  2006-07-14 17:18                             ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-14 16:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Dave Hansen, Cedric Le Goater, linux-kernel,
	Andrew Morton, Kirill Korotaev, Andrey Savochkin, Herbert Poetzl,
	Sam Vilain

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> 
> > Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> If you permission checks are not (user namespace, uid) what can't you do?
> >
> > File descripters can only be passed over a unix socket, right?
> 
> No.  You can use /proc to do the same thing.  You can inherit

How?

> file descriptors, etc.  This isn't a door you can just close and ignore.
> It is too easy to do this to assume you have closed every possible
> way to get a descriptor from outside of ``container''.
> 
> Suppose you have user fred uid 1000 outside of any containers,
> and you have user joe uid 1000 inside user uid namespace.  If you don't
> make your permission checks (user namespace, uid) fred can do terrible

Only if user fred can reach user joe's files.

> If I we don't do the expanded permission checks the only alternative
> is to check to see if every object is in our ``container'' at every
> access.  That isn't something I want to do.

Ditto.

> I don't intend to partition objects just partition object look ups by name.
> Which means that if you very carefully setup your initial process you
> will never be able to find an object outside of your namespace.  But
> the kernel never should assume user space has done that.

Are you contradicting yourself here?

> > So this seems to fall into the same "userspace should set things up
> > sanely" argument you've brought up before.
> 
> For using it yes.  But not for correct kernel operation.

Well if we say the kernel controls files (barring LSMs) based on simple
uids, then that is correct kernel operation :)

In other words we're not letting the kernel do anything it isn't
supposed to do.

> > Don't get me wrong though - the idea of using in-kernel keys as
> > cross-namespace uid's is definately interesting.
> 
> That is their role in the kernel.

They have more roles than that...

I phrased it the way I did to point out we shouldn't need to be using
ecryptfs - which uses the keyring to store actual encryptions keys - to
do this.

> A flavor of key to handle uid
> mapping needs to be added, but that is all.

How can that be all?  How do we represent this on the filesystem then?

We can't store (namespace, uid) because we presumably dont' have actual
external unique ids for a namespace.  So we really will need to provide
global uids, and store those in the keys (and with the files).  So now
3475276bc0dcd9cc501ba9bb7f12683f1b867066 is my uid no matter where I am,
for instance.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 16:39                           ` Serge E. Hallyn
@ 2006-07-14 17:18                             ` Eric W. Biederman
  2006-07-14 17:24                               ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 17:18 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>> 
>> > Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> If you permission checks are not (user namespace, uid) what can't you do?
>> >
>> > File descripters can only be passed over a unix socket, right?
>> 
>> No.  You can use /proc to do the same thing.  You can inherit
>
> How?

/proc/<pid>/fd/...
/proc/<pid>/exe
/proc/<pid>/cwd

It isn't quite the same as you are actually opening a second
copy of the file descriptor but the essence is the same.

>> file descriptors, etc.  This isn't a door you can just close and ignore.
>> It is too easy to do this to assume you have closed every possible
>> way to get a descriptor from outside of ``container''.
>> 
>> Suppose you have user fred uid 1000 outside of any containers,
>> and you have user joe uid 1000 inside user uid namespace.  If you don't
>> make your permission checks (user namespace, uid) fred can do terrible
>
> Only if user fred can reach user joe's files.

If they are logged in a the same time this is possible.

>> If I we don't do the expanded permission checks the only alternative
>> is to check to see if every object is in our ``container'' at every
>> access.  That isn't something I want to do.
>
> Ditto.
>
>> I don't intend to partition objects just partition object look ups by name.
                                                            ^just
>> Which means that if you very carefully setup your initial process you
>> will never be able to find an object outside of your namespace.  But
>> the kernel never should assume user space has done that.
>
> Are you contradicting yourself here?

I don't think so.

Basically what I am saying all new looks happen in the new namespace.
That makes it hard to find objects that are not in your namespace.
If you already have those objects open, or if can find a way besides
lookup by name to access those objects you can see objects in another
namespace.

>> > So this seems to fall into the same "userspace should set things up
>> > sanely" argument you've brought up before.
>> 
>> For using it yes.  But not for correct kernel operation.
>
> Well if we say the kernel controls files (barring LSMs) based on simple
> uids, then that is correct kernel operation :)

If we so define it yes, I believe that definition is wrong.

> In other words we're not letting the kernel do anything it isn't
> supposed to do.

In the context of user namespaces if the compare is not 
(user_ns, uid) == (other_user_ns, other_uid) then we are.

>> > Don't get me wrong though - the idea of using in-kernel keys as
>> > cross-namespace uid's is definately interesting.
>> 
>> That is their role in the kernel.
>
> They have more roles than that...
>
> I phrased it the way I did to point out we shouldn't need to be using
> ecryptfs - which uses the keyring to store actual encryptions keys - to
> do this.

Ok. I will buy that.

>> A flavor of key to handle uid
>> mapping needs to be added, but that is all.
>
> How can that be all?  How do we represent this on the filesystem then?

You handle it at the VFS layer unless the filesystem overrides the
permission checks.

> We can't store (namespace, uid) because we presumably dont' have actual
> external unique ids for a namespace.  So we really will need to provide
> global uids, and store those in the keys (and with the files).  So now
> 3475276bc0dcd9cc501ba9bb7f12683f1b867066 is my uid no matter where I am,
> for instance.

That is one way to do it.  Anyway as long as we can see it is possible using
keys now we can sort out the rest of the details later.

Eric


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 17:18                             ` Eric W. Biederman
@ 2006-07-14 17:24                               ` Dave Hansen
  2006-07-14 18:06                                 ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2006-07-14 17:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

On Fri, 2006-07-14 at 11:18 -0600, Eric W. Biederman wrote:
> /proc/<pid>/fd/...
> /proc/<pid>/exe
> /proc/<pid>/cwd
> 
> It isn't quite the same as you are actually opening a second
> copy of the file descriptor but the essence is the same. 

Last I checked, those were symlinks and didn't work for things like
deleted files.  Am I wrong?

-- Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 17:24                               ` Dave Hansen
@ 2006-07-14 18:06                                 ` Eric W. Biederman
  2006-07-14 18:42                                   ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 18:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

Dave Hansen <haveblue@us.ibm.com> writes:

> On Fri, 2006-07-14 at 11:18 -0600, Eric W. Biederman wrote:
>> /proc/<pid>/fd/...
>> /proc/<pid>/exe
>> /proc/<pid>/cwd
>> 
>> It isn't quite the same as you are actually opening a second
>> copy of the file descriptor but the essence is the same. 
>
> Last I checked, those were symlinks and didn't work for things like
> deleted files.  Am I wrong?

Yes.  They are not really symlinks.

Wanting to have an executable that was deleted after it was done
executing.  I wrote it to a file. opened it, unlinked it, set close
on exec, and the exec'd it with /proc/self/fd/N.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 18:06                                 ` Eric W. Biederman
@ 2006-07-14 18:42                                   ` Dave Hansen
  2006-07-14 19:07                                     ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2006-07-14 18:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

On Fri, 2006-07-14 at 12:06 -0600, Eric W. Biederman wrote:
> > On Fri, 2006-07-14 at 11:18 -0600, Eric W. Biederman wrote:
> >> /proc/<pid>/fd/...
> >> /proc/<pid>/exe
> >> /proc/<pid>/cwd
> >> 
> >> It isn't quite the same as you are actually opening a second
> >> copy of the file descriptor but the essence is the same. 
> >
> > Last I checked, those were symlinks and didn't work for things like
> > deleted files.  Am I wrong?
> 
> Yes.  They are not really symlinks.
> 
> Wanting to have an executable that was deleted after it was done
> executing.  I wrote it to a file. opened it, unlinked it, set close
> on exec, and the exec'd it with /proc/self/fd/N.

Well, on one hand, it makes checkpoints with deleted files easier ;)

Now that I'm actually looking at the code, isn't
proc_fd_access_allowed()'s permission just derived from ptrace
permissions?  It doesn't seem to involve uids directly at all!

-- Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 18:42                                   ` Dave Hansen
@ 2006-07-14 19:07                                     ` Eric W. Biederman
  0 siblings, 0 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 19:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain

Dave Hansen <haveblue@us.ibm.com> writes:

> On Fri, 2006-07-14 at 12:06 -0600, Eric W. Biederman wrote:
>> > On Fri, 2006-07-14 at 11:18 -0600, Eric W. Biederman wrote:
>> >> /proc/<pid>/fd/...
>> >> /proc/<pid>/exe
>> >> /proc/<pid>/cwd
>> >> 
>> >> It isn't quite the same as you are actually opening a second
>> >> copy of the file descriptor but the essence is the same. 
>> >
>> > Last I checked, those were symlinks and didn't work for things like
>> > deleted files.  Am I wrong?
>> 
>> Yes.  They are not really symlinks.
>> 
>> Wanting to have an executable that was deleted after it was done
>> executing.  I wrote it to a file. opened it, unlinked it, set close
>> on exec, and the exec'd it with /proc/self/fd/N.
>
> Well, on one hand, it makes checkpoints with deleted files easier ;)
>
> Now that I'm actually looking at the code, isn't
> proc_fd_access_allowed()'s permission just derived from ptrace
> permissions?  It doesn't seem to involve uids directly at all!

You still have to actually open the file and from there you get to permission().

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 17:36         ` Cedric Le Goater
  2006-07-13 17:47           ` Serge E. Hallyn
@ 2006-07-13 17:59           ` Eric W. Biederman
  2006-07-13 21:22             ` Serge E. Hallyn
  1 sibling, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-13 17:59 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Cedric Le Goater <clg@fr.ibm.com> writes:

> Eric W. Biederman wrote:
>
>>> I think user namespace should be unshared with filesystem. if not, the
>>> user/admin should know what is doing.
>> 
>> No.  The uids in a filesystem are interpreted in some user namespace
>> context.  We can discover that context at the first mount of the
>> filesystem. 
>
> ok. so once you're in such a user namespace, you can't unshare from it
> without loosing access to all your files ?

Any file that is accessible without needing specific user and group
permissions will still be accessible.  In essence you are an enhanced
version of the nobody user when you access files outside of your
namespace.  What you can't do is create files as you cannot be
represented on the filesystem.

>> Assuming the uids on a filesystem are the same set of uids your process
>> is using is just wrong.
>
> well, this is what is currently done without user namespaces.

Yes.  But the whole system is assumed to be in the same user namespace,
so that is a reasonable assumption.  Once we break that assumption we
need to do something different.  There are a few network filesystems
that are at least working on using keys to authenticate users.

>>>> I believe some of the key infrastructure which is roughly kerberos
>>>> authentication tokens could be used for this purpose.
>>> please elaborate ? i'm not sure to understand why you want to use the keys
>>> to map users.
>> 
>> keys are essentially security credentials for something besides the
>> local kernel.  Think kerberos tickets.  That makes the keys the
>> obvious place to say what uid you are in a different user namespace
>> and similar things.
>
> what about performance ? wouldn't that slow the checking ?

It needs to be looked at, but it shouldn't slow the same namespace
case, and permission checking is largely a slow path issue.  So a little
overhead at open time is preferred to overhead after you get the file
open.

Mapping users is clearly a separate problem, but related problem.  For
sharing read-only data that you don't mind sharing with everyone the
current set of semantics I have described is fine.  

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 17:59           ` Eric W. Biederman
@ 2006-07-13 21:22             ` Serge E. Hallyn
  2006-07-14  3:50               ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-13 21:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Quoting Eric W. Biederman (ebiederm@xmission.com):
> >> keys are essentially security credentials for something besides the
> >> local kernel.  Think kerberos tickets.  That makes the keys the
> >> obvious place to say what uid you are in a different user namespace
> >> and similar things.
> >
> > what about performance ? wouldn't that slow the checking ?
> 
> It needs to be looked at, but it shouldn't slow the same namespace
> case,

How so?  The processesing is the same.

> and permission checking is largely a slow path issue.  So a little
> overhead at open time is preferred to overhead after you get the file
> open.

Unsure which approach has overhead after file open...

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-13 21:22             ` Serge E. Hallyn
@ 2006-07-14  3:50               ` Eric W. Biederman
  0 siblings, 0 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14  3:50 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Dave Hansen

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> >> keys are essentially security credentials for something besides the
>> >> local kernel.  Think kerberos tickets.  That makes the keys the
>> >> obvious place to say what uid you are in a different user namespace
>> >> and similar things.
>> >
>> > what about performance ? wouldn't that slow the checking ?
>> 
>> It needs to be looked at, but it shouldn't slow the same namespace
>> case,
>
> How so?  The processesing is the same.

If you do mapping of uids that would not trigger if you have
matching user namespaces.

>> and permission checking is largely a slow path issue.  So a little
>> overhead at open time is preferred to overhead after you get the file
>> open.
>
> Unsure which approach has overhead after file open...

I don't know that I have seen that one yet.  Sorry this was generally
a statement of principle.

Eric


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-12 17:24       ` Eric W. Biederman
  2006-07-13 17:36         ` Cedric Le Goater
@ 2006-07-14 14:17         ` Serge E. Hallyn
  2006-07-14 15:05           ` Eric W. Biederman
  2006-07-14 15:43           ` Kyle Moffett
  1 sibling, 2 replies; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-14 14:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Quoting Eric W. Biederman (ebiederm@xmission.com):
> Cedric Le Goater <clg@fr.ibm.com> writes:
> 
> > Eric W. Biederman wrote:
> >> Cedric Le Goater <clg@fr.ibm.com> writes:
> >> 
> >>> This patch adds the user namespace.
> >>>
> >>> Basically, it allows a process to unshare its user_struct table,
> >>> resetting at the same time its own user_struct and all the associated
> >>> accounting.
> >>>
> >>> For the moment, the root_user is added to the new user namespace when
> >>> it is cloned. An alternative behavior would be to let the system
> >>> allocate a new user_struct(0) in each new user namespace. However,
> >>> these 0 users would not have the privileges of the root_user and it
> >>> would be necessary to work on the process capabilities to give them
> >>> some permissions.
> >> 
> >> It is completely the wrong thing for a the root_user to span multiple
> >> namespaces as you describe.  It is important for uid 0 in other namespaces
> >> to not have the privileges of the root_user.  That is half the point.
> >
> > ok good. that's what i thought also.
> >
> >> Too many files in sysfs and proc don't require caps but instead simply
> >> limit things to uid 0.  Having a separate uid 0 in the different namespaces
> >> instantly makes all of these files inaccessible, and keeps processes from
> >> doing something bad.
> >
> > but in order to be useful, the uid 0 in other namespaces will need to have
> > some capabilities.
> 
> Yes.  It is useful for uid 0 in other namespaces to have some capabilities.
> But that is just a matter of giving another user some additional
> capabilities.  That mechanism may still work but it should not be
> namespace specific magic there.  The trick I guess is which
> capabilities a setuid binary for the other root user gets.

Agreed.  Any ideas for how to specify this?  In a sane way?  I suppose
it could be a part of forking off the user namespace.  No idea what
interface we'd want to use.  Perhaps root user in the child namespace by
default gets all the caps as the root user, and is expected to drop what
it doesn't need?

> One thing to be careful about here is that, at least as I 
> envision it until setresuid is called your user does not change
> even when you unshare your user namespace.
> 
> >> To a filesystem a uid does not share a uid namespace with the only things
> >> that should be accessible are those things that are readable/writeable
> >> by everyone.   Unless the filesystem has provisions for storing multiple
> >> uid namespaces not files should be able to be created.  Think NFS root
> >> squash.
> >
> > I think user namespace should be unshared with filesystem. if not, the
> > user/admin should know what is doing.
> 
> No.  The uids in a filesystem are interpreted in some user namespace
> context.  We can discover that context at the first mount of the
> filesystem.  Assuming the uids on a filesystem are the same set
> of uids your process is using is just wrong.

But, when I insert a usb keychain disk into my laptop, that fs assumes
the uids on it's fs are the same as uids on my laptop...

Solving that problem is interesting, but may be something to work with a
much wider community on.  I.e. the cifs and nifs folks.  I haven't even
googled to see what they say about it.

> > now, we could work on extending it to support fine grain user namespace
> > which i think can done on top of this first patch.
> 
> Yes.  Your patch does lay some interesting foundation work.
> But we must not merge it upstream until we have a complete patchset
> that handles all of the user namespace issues.

Don't think Cedric expected this to be merged  :)  Just to start
discussion, which it certainly did...

> >> I think the key infrastructure needs to be looked at here as well.
> >>
> >> There needs to be a user namespace association for mounted filesystems.
> >
> > yes you could expect that to check the i_uid against fsuid but should we
> > enforce it completely ?
> >
> > we already have an issue today with a simple NFS mount on 2 hosts with
> > different user mapping. namespace can't fix all issues.
> 
> Yes.  This is an existing problem, which we have just escalated the
> frequency of immensely if we are doing user namespaces.  The normal
> solution is to put everyone on the network in a single user id
> administration domain with ldap or NIS.
> 
> However that is avoiding the problem, and having multiple user id
> domains is the point of a user id namespace.  If we escalate the
> problem we should solve it.

Of course if we solve it in that way, then we may not need user
namespaces any more  :)  Only the root user ends up still shared, and
even that could be solved if we're doing something that drastic.

> keys are essentially security credentials for something besides the
> local kernel.  Think kerberos tickets.  That makes the keys the
> obvious place to say what uid you are in a different user namespace
> and similar things.

If we're going to talk about keys (which I'd like to) I think we need to
decide whether we are just using them as an easy wider-than-uid
identifier, or if we actually need cryptographic keys to prevent
"identity theft"  (heheh).  I don't know that we need the latter for
anything, but of course if we're going to try for a more general
solution, then we do.

-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 14:17         ` Serge E. Hallyn
@ 2006-07-14 15:05           ` Eric W. Biederman
  2006-07-14 16:46             ` Serge E. Hallyn
  2006-07-14 15:43           ` Kyle Moffett
  1 sibling, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 15:05 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Dave Hansen

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> Cedric Le Goater <clg@fr.ibm.com> writes:
>> 
>> > Eric W. Biederman wrote:
>> >> Cedric Le Goater <clg@fr.ibm.com> writes:
>> >> 
>> >>> This patch adds the user namespace.
>> >>>
>> >>> Basically, it allows a process to unshare its user_struct table,
>> >>> resetting at the same time its own user_struct and all the associated
>> >>> accounting.
>> >>>
>> >>> For the moment, the root_user is added to the new user namespace when
>> >>> it is cloned. An alternative behavior would be to let the system
>> >>> allocate a new user_struct(0) in each new user namespace. However,
>> >>> these 0 users would not have the privileges of the root_user and it
>> >>> would be necessary to work on the process capabilities to give them
>> >>> some permissions.
>> >> 
>> >> It is completely the wrong thing for a the root_user to span multiple
>> >> namespaces as you describe.  It is important for uid 0 in other namespaces
>> >> to not have the privileges of the root_user.  That is half the point.
>> >
>> > ok good. that's what i thought also.
>> >
>> >> Too many files in sysfs and proc don't require caps but instead simply
>> >> limit things to uid 0.  Having a separate uid 0 in the different namespaces
>> >> instantly makes all of these files inaccessible, and keeps processes from
>> >> doing something bad.
>> >
>> > but in order to be useful, the uid 0 in other namespaces will need to have
>> > some capabilities.
>> 
>> Yes.  It is useful for uid 0 in other namespaces to have some capabilities.
>> But that is just a matter of giving another user some additional
>> capabilities.  That mechanism may still work but it should not be
>> namespace specific magic there.  The trick I guess is which
>> capabilities a setuid binary for the other root user gets.
>
> Agreed.  Any ideas for how to specify this?  In a sane way?  I suppose
> it could be a part of forking off the user namespace.  No idea what
> interface we'd want to use.  Perhaps root user in the child namespace by
> default gets all the caps as the root user, and is expected to drop what
> it doesn't need?

Currently this is a security module policy, so it gets weird.
The default is:
> int cap_bprm_set_security (struct linux_binprm *bprm)
> {
> 	/* Copied from fs/exec.c:prepare_binprm. */
> 
> 	/* We don't have VFS support for capabilities yet */
> 	cap_clear (bprm->cap_inheritable);
> 	cap_clear (bprm->cap_permitted);
> 	cap_clear (bprm->cap_effective);
> 
> 	/*  To support inheritance of root-permissions and suid-root
> 	 *  executables under compatibility mode, we raise all three
> 	 *  capability sets for the file.
> 	 *
> 	 *  If only the real uid is 0, we only raise the inheritable
> 	 *  and permitted sets of the executable file.
> 	 */
> 
> 	if (!issecure (SECURE_NOROOT)) {
> 		if (bprm->e_uid == 0 || current->uid == 0) {
> 			cap_set_full (bprm->cap_inheritable);
> 			cap_set_full (bprm->cap_permitted);
> 		}
> 		if (bprm->e_uid == 0)
> 			cap_set_full (bprm->cap_effective);
> 	}
> 	return 0;
> }

My gut feel is that we put a cap_suid in struct user, and the
setup some kind of user interface to it.  And then just
have the above routine copy cap_suid from the struct user
and not special case uid == 0.

But only the root user would by default have any caps in his
struct user.

>> No.  The uids in a filesystem are interpreted in some user namespace
>> context.  We can discover that context at the first mount of the
>> filesystem.  Assuming the uids on a filesystem are the same set
>> of uids your process is using is just wrong.
>
> But, when I insert a usb keychain disk into my laptop, that fs assumes
> the uids on it's fs are the same as uids on my laptop...

I agree that setting the fs_user_namespace at mount time is fine.
However when we use a mount that a process in another user namespace
we need to not assume the uids are the same.

Do you see the difference?

> Solving that problem is interesting, but may be something to work with a
> much wider community on.  I.e. the cifs and nifs folks.  I haven't even
> googled to see what they say about it.

Yes.

>> Yes.  Your patch does lay some interesting foundation work.
>> But we must not merge it upstream until we have a complete patchset
>> that handles all of the user namespace issues.
>
> Don't think Cedric expected this to be merged  :)  Just to start
> discussion, which it certainly did...

If we could have [RFC] in front of these proof of concept patches
it would clear up a lot of confusion.

> If we're going to talk about keys (which I'd like to) I think we need to
> decide whether we are just using them as an easy wider-than-uid
> identifier, or if we actually need cryptographic keys to prevent
> "identity theft"  (heheh).  I don't know that we need the latter for
> anything, but of course if we're going to try for a more general
> solution, then we do.

Actually I was thinking something as mundane as a mapping table.  This
uid in this namespace equals that uid in that other namespace.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 15:05           ` Eric W. Biederman
@ 2006-07-14 16:46             ` Serge E. Hallyn
  2006-07-14 16:58               ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Serge E. Hallyn @ 2006-07-14 16:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain,
	Dave Hansen

Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serue@us.ibm.com> writes:
> >> No.  The uids in a filesystem are interpreted in some user namespace
> >> context.  We can discover that context at the first mount of the
> >> filesystem.  Assuming the uids on a filesystem are the same set
> >> of uids your process is using is just wrong.
> >
> > But, when I insert a usb keychain disk into my laptop, that fs assumes
> > the uids on it's fs are the same as uids on my laptop...
> 
> I agree that setting the fs_user_namespace at mount time is fine.
> However when we use a mount that a process in another user namespace
> we need to not assume the uids are the same.
> 
> Do you see the difference?

Aaah - so you don't want to store this on the fs.  So this is actually
like what I had mentioned many many emails ago?

> > much wider community on.  I.e. the cifs and nifs folks.  I haven't even
> > googled to see what they say about it.
> 
> Yes.
> 
> >> Yes.  Your patch does lay some interesting foundation work.
> >> But we must not merge it upstream until we have a complete patchset
> >> that handles all of the user namespace issues.
> >
> > Don't think Cedric expected this to be merged  :)  Just to start
> > discussion, which it certainly did...
> 
> If we could have [RFC] in front of these proof of concept patches
> it would clear up a lot of confusion.

Agreed.

> > If we're going to talk about keys (which I'd like to) I think we need to
> > decide whether we are just using them as an easy wider-than-uid
> > identifier, or if we actually need cryptographic keys to prevent
> > "identity theft"  (heheh).  I don't know that we need the latter for
> > anything, but of course if we're going to try for a more general
> > solution, then we do.
> 
> Actually I was thinking something as mundane as a mapping table.  This
> uid in this namespace equals that uid in that other namespace.

I see.

That's also what I was imagining earlier, but it seems crass somehow.
I'd almost prefer to just tag a mount with a user namespace implicitly,
and only allow the mounter to say 'do' or 'don't' allow this to be read
by users in another namespace.  Then in the 'don't' case, user joe
[1000] can't read files belonging to user jack [1000] in another
namespace.  It's stricter, but clean.

But whether we do mapping tables or simple isolation, I do still like
the idea of pursuing the use of the keystore for global uids.

thanks,
-serge

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 16:46             ` Serge E. Hallyn
@ 2006-07-14 16:58               ` Eric W. Biederman
  0 siblings, 0 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 16:58 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Dave Hansen

"Serge E. Hallyn" <serue@us.ibm.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> "Serge E. Hallyn" <serue@us.ibm.com> writes:
>> >> No.  The uids in a filesystem are interpreted in some user namespace
>> >> context.  We can discover that context at the first mount of the
>> >> filesystem.  Assuming the uids on a filesystem are the same set
>> >> of uids your process is using is just wrong.
>> >
>> > But, when I insert a usb keychain disk into my laptop, that fs assumes
>> > the uids on it's fs are the same as uids on my laptop...
>> 
>> I agree that setting the fs_user_namespace at mount time is fine.
>> However when we use a mount that a process in another user namespace
>> we need to not assume the uids are the same.
>> 
>> Do you see the difference?
>
> Aaah - so you don't want to store this on the fs.  So this is actually
> like what I had mentioned many many emails ago?

Quite possibly.  I'm not certain where you go the idea I was thinking
of storing this on the fs.  I think you must have been thinking of
the Linux-Vserver implementation.

>> Actually I was thinking something as mundane as a mapping table.  This
>> uid in this namespace equals that uid in that other namespace.
>
> I see.
>
> That's also what I was imagining earlier, but it seems crass somehow.
> I'd almost prefer to just tag a mount with a user namespace implicitly,
> and only allow the mounter to say 'do' or 'don't' allow this to be read
> by users in another namespace.  Then in the 'don't' case, user joe
> [1000] can't read files belonging to user jack [1000] in another
> namespace.  It's stricter, but clean.
>
> But whether we do mapping tables or simple isolation, I do still like
> the idea of pursuing the use of the keystore for global uids.

Yes.  I guess my thinking is that the mapping effort and keys are an
enhancement after we get the basic user namespace working, to overcome
the limitations.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 14:17         ` Serge E. Hallyn
  2006-07-14 15:05           ` Eric W. Biederman
@ 2006-07-14 15:43           ` Kyle Moffett
  2006-07-14 16:13             ` Eric W. Biederman
  1 sibling, 1 reply; 107+ messages in thread
From: Kyle Moffett @ 2006-07-14 15:43 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Eric W. Biederman, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain,
	Dave Hansen

On Jul 14, 2006, at 10:17:28, Serge E. Hallyn wrote:
> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> No.  The uids in a filesystem are interpreted in some user  
>> namespace context.  We can discover that context at the first  
>> mount of the filesystem.  Assuming the uids on a filesystem are  
>> the same set of uids your process is using is just wrong.
>
> But, when I insert a usb keychain disk into my laptop, that fs  
> assumes the uids on it's fs are the same as uids on my laptop...
>
> Solving that problem is interesting, but may be something to work  
> with a much wider community on.  I.e. the cifs and nifs folks.  I  
> haven't even googled to see what they say about it.

IMHO filesystems _and_ processes should be primary objects in a UID  
namespace.  This would make it possible to solve the usb-key problems  
_and_ the user-mounted FUSE problems.  If "ns0" is the boot uid  
namespace, then put the freshly mounted USB key in a new "ns1" (names  
just for convenience).  All the user processes would continue to be  
in ns0, but with the kernel keyring system you could create a new  
"uid" keytype and give the logged in user (ns0,user_uid) a user-key  
with (ns1,0*).  If you added bits to the user-keys to represent the  
equivalent of CAP_DAC_OVERRIDE/CAP_CHOWN/etc for that process and UID  
namespace, then the user could do anything to any file on their USB  
key, even change ownership, without disrupting the rest of the  
system.  Likewise, if you did that for user FUSE filesystems, then  
suid binaries would not be able to get themselves into trouble in  
exploitive FUSE infinitely-recursive monstrosities.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-14 15:43           ` Kyle Moffett
@ 2006-07-14 16:13             ` Eric W. Biederman
  0 siblings, 0 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-14 16:13 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Serge E. Hallyn, Cedric Le Goater, linux-kernel, Andrew Morton,
	Kirill Korotaev, Andrey Savochkin, Herbert Poetzl, Sam Vilain,
	Dave Hansen

Kyle Moffett <mrmacman_g4@mac.com> writes:

> On Jul 14, 2006, at 10:17:28, Serge E. Hallyn wrote:
>> Quoting Eric W. Biederman (ebiederm@xmission.com):
>>> No.  The uids in a filesystem are interpreted in some user  namespace
>>> context.  We can discover that context at the first mount of the filesystem.
>>> Assuming the uids on a filesystem are  the same set of uids your process is
>>> using is just wrong.
>>
>> But, when I insert a usb keychain disk into my laptop, that fs  assumes the
>> uids on it's fs are the same as uids on my laptop...
>>
>> Solving that problem is interesting, but may be something to work with a much
>> wider community on.  I.e. the cifs and nifs folks.  I haven't even googled to
>> see what they say about it.
>
> IMHO filesystems _and_ processes should be primary objects in a UID  namespace.
> This would make it possible to solve the usb-key problems  _and_ the
> user-mounted FUSE problems.  If "ns0" is the boot uid  namespace, then put the
> freshly mounted USB key in a new "ns1" (names  just for convenience).  All the
> user processes would continue to be  in ns0, but with the kernel keyring system
> you could create a new "uid" keytype and give the logged in user (ns0,user_uid)
> a user-key  with (ns1,0*).  If you added bits to the user-keys to represent the
> equivalent of CAP_DAC_OVERRIDE/CAP_CHOWN/etc for that process and UID
> namespace, then the user could do anything to any file on their USB  key, even
> change ownership, without disrupting the rest of the  system.  Likewise, if you
> did that for user FUSE filesystems, then suid binaries would not be able to get
> themselves into trouble in  exploitive FUSE infinitely-recursive monstrosities.

Thank you!

It is nice to see when someone else gets the point :)

I had not quite considered how that affects user mounted filesystems
but that does look like a real solution.

Now we just need to implement these things and work out the details of
user keys to map user ids.

Eric


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH -mm 6/7] add the user namespace to the execns syscall
  2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
                   ` (4 preceding siblings ...)
  2006-07-11  7:50 ` [PATCH -mm 5/7] add user namespace Cedric Le Goater
@ 2006-07-11  7:50 ` Cedric Le Goater
  2006-07-11  7:50 ` [PATCH -mm 7/7] forbid the use of the unshare syscall on ipc namespaces Cedric Le Goater
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Cedric Le Goater, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

[-- Attachment #1: user-namespace-add-execns-syscall.patch --]
[-- Type: text/plain, Size: 3088 bytes --]

This patch allows a process to unshare its user namespace through
the execns() syscall.

It also forbids the use of the unshare() syscall on user namespaces.
The purpose of this restriction is to prevent inconsistencies in the
accounting for each process.

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>

---
 fs/exec.c     |   22 ++++++++++++++++++----
 kernel/fork.c |    5 +++++
 2 files changed, 23 insertions(+), 4 deletions(-)

Index: 2.6.18-rc1-mm1/fs/exec.c
===================================================================
--- 2.6.18-rc1-mm1.orig/fs/exec.c
+++ 2.6.18-rc1-mm1/fs/exec.c
@@ -1254,6 +1254,7 @@ int do_execns(int unshare_flags, char * 
 	struct nsproxy *new_nsproxy = NULL, *old_nsproxy = NULL;
 	struct uts_namespace *uts, *new_uts = NULL;
 	struct ipc_namespace *ipc, *new_ipc = NULL;
+	struct user_namespace *user, *new_user = NULL;
 
 	err = unshare_utsname(unshare_flags, &new_uts);
 	if (err)
@@ -1261,24 +1262,27 @@ int do_execns(int unshare_flags, char * 
 	err = unshare_ipcs(unshare_flags, &new_ipc);
 	if (err)
 		goto bad_execns_cleanup_uts;
+	err = unshare_user_ns(unshare_flags, &new_user);
+	if (err)
+		goto bad_execns_cleanup_ipc;
 
-	if (new_uts || new_ipc) {
+	if (new_uts || new_ipc || new_user) {
 		old_nsproxy = current->nsproxy;
 		new_nsproxy = dup_namespaces(old_nsproxy);
 		if (!new_nsproxy) {
 			err = -ENOMEM;
-			goto bad_execns_cleanup_ipc;
+			goto bad_execns_cleanup_user;
 		}
 	}
 
 	err = do_execve(filename, argv, envp, regs);
 	if (err)
-		goto bad_execns_cleanup_ipc;
+		goto bad_execns_cleanup_user;
 
 	/* make sure all files are flushed */
 	flush_all_old_files(current->files);
 
-	if (new_uts || new_ipc) {
+	if (new_uts || new_ipc || new_user) {
 
 		task_lock(current);
 
@@ -1299,12 +1303,22 @@ int do_execns(int unshare_flags, char * 
 			new_ipc = ipc;
 		}
 
+		if (new_user) {
+			user = current->nsproxy->user_ns;
+			current->nsproxy->user_ns = new_user;
+			new_user = user;
+		}
+
 		task_unlock(current);
 	}
 
 	if (new_nsproxy)
 		put_nsproxy(new_nsproxy);
 
+bad_execns_cleanup_user:
+	if (new_user)
+		put_user_ns(new_user);
+
 bad_execns_cleanup_ipc:
 	if (new_ipc)
 		put_ipc_ns(new_ipc);
Index: 2.6.18-rc1-mm1/kernel/fork.c
===================================================================
--- 2.6.18-rc1-mm1.orig/kernel/fork.c
+++ 2.6.18-rc1-mm1/kernel/fork.c
@@ -1615,6 +1615,11 @@ asmlinkage long sys_unshare(unsigned lon
 				CLONE_NEWUTS|CLONE_NEWIPC))
 		goto bad_unshare_out;
 
+	/* Also return -EINVAL for all unsharable namespaces. May be a
+	 * -EACCES would be more appropriate ? */
+	if (unshare_flags & CLONE_NEWUSER)
+		goto bad_unshare_out;
+
 	if ((err = unshare_thread(unshare_flags)))
 		goto bad_unshare_out;
 	if ((err = unshare_fs(unshare_flags, &new_fs)))

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH -mm 7/7] forbid the use of the unshare syscall on ipc namespaces
  2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
                   ` (5 preceding siblings ...)
  2006-07-11  7:50 ` [PATCH -mm 6/7] add the user namespace to the execns syscall Cedric Le Goater
@ 2006-07-11  7:50 ` Cedric Le Goater
  2006-07-11 14:10   ` Kirill Korotaev
  2006-07-11  8:02 ` [PATCH -mm 0/7] execns syscall and user namespace Arjan van de Ven
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11  7:50 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Cedric Le Goater, Pavel Emelianov, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

[-- Attachment #1: ipc-namespace-remove-unshare.patch --]
[-- Type: text/plain, Size: 2949 bytes --]

This patch forbids the use of the unshare() syscall on ipc namespaces.

The purpose of this restriction is to protect the system from
inconsistencies when the namespace is unshared. e.g. shared memory ids
will be removed but not the memory mappings, semaphore ids will be
removed but the semundos not cleared.


Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Andrey Savochkin <saw@sw.ru>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>

---
 kernel/fork.c |   23 +++++------------------
 1 file changed, 5 insertions(+), 18 deletions(-)

Index: 2.6.18-rc1-mm1/kernel/fork.c
===================================================================
--- 2.6.18-rc1-mm1.orig/kernel/fork.c
+++ 2.6.18-rc1-mm1/kernel/fork.c
@@ -1604,7 +1604,6 @@ asmlinkage long sys_unshare(unsigned lon
 	struct sem_undo_list *new_ulist = NULL;
 	struct nsproxy *new_nsproxy = NULL, *old_nsproxy = NULL;
 	struct uts_namespace *uts, *new_uts = NULL;
-	struct ipc_namespace *ipc, *new_ipc = NULL;
 
 	check_unshare_flags(&unshare_flags);
 
@@ -1612,12 +1611,12 @@ asmlinkage long sys_unshare(unsigned lon
 	err = -EINVAL;
 	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
 				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
-				CLONE_NEWUTS|CLONE_NEWIPC))
+				CLONE_NEWUTS))
 		goto bad_unshare_out;
 
 	/* Also return -EINVAL for all unsharable namespaces. May be a
 	 * -EACCES would be more appropriate ? */
-	if (unshare_flags & CLONE_NEWUSER)
+	if (unshare_flags & (CLONE_NEWUSER|CLONE_NEWIPC))
 		goto bad_unshare_out;
 
 	if ((err = unshare_thread(unshare_flags)))
@@ -1636,20 +1635,18 @@ asmlinkage long sys_unshare(unsigned lon
 		goto bad_unshare_cleanup_fd;
 	if ((err = unshare_utsname(unshare_flags, &new_uts)))
 		goto bad_unshare_cleanup_semundo;
-	if ((err = unshare_ipcs(unshare_flags, &new_ipc)))
-		goto bad_unshare_cleanup_uts;
 
-	if (new_ns || new_uts || new_ipc) {
+	if (new_ns || new_uts) {
 		old_nsproxy = current->nsproxy;
 		new_nsproxy = dup_namespaces(old_nsproxy);
 		if (!new_nsproxy) {
 			err = -ENOMEM;
-			goto bad_unshare_cleanup_ipc;
+			goto bad_unshare_cleanup_uts;
 		}
 	}
 
 	if (new_fs || new_ns || new_sigh || new_mm || new_fd || new_ulist ||
-				new_uts || new_ipc) {
+				new_uts) {
 
 		task_lock(current);
 
@@ -1697,22 +1694,12 @@ asmlinkage long sys_unshare(unsigned lon
 			new_uts = uts;
 		}
 
-		if (new_ipc) {
-			ipc = current->nsproxy->ipc_ns;
-			current->nsproxy->ipc_ns = new_ipc;
-			new_ipc = ipc;
-		}
-
 		task_unlock(current);
 	}
 
 	if (new_nsproxy)
 		put_nsproxy(new_nsproxy);
 
-bad_unshare_cleanup_ipc:
-	if (new_ipc)
-		put_ipc_ns(new_ipc);
-
 bad_unshare_cleanup_uts:
 	if (new_uts)
 		put_uts_ns(new_uts);

--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 7/7] forbid the use of the unshare syscall on ipc namespaces
  2006-07-11  7:50 ` [PATCH -mm 7/7] forbid the use of the unshare syscall on ipc namespaces Cedric Le Goater
@ 2006-07-11 14:10   ` Kirill Korotaev
  2006-07-11 15:06     ` Cedric Le Goater
  0 siblings, 1 reply; 107+ messages in thread
From: Kirill Korotaev @ 2006-07-11 14:10 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Pavel Emelianov, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

This patch looks as an overkill for me.

If you really care about things you describe, you can forbid unsharing in cases:

1.
        undo_list = tsk->sysvsem.undo_list;
        if (undo_list)
                REFUSE_UNSHARE;
2. vma exists with vma->vm_ops == &shm_vm_ops;
3. file opened with f_op == &shm_file_operations

I also dislike exec() operation for such sort of things since you can have no executable
at hands due to changed fs namespace.

Thanks,
Kirill


> This patch forbids the use of the unshare() syscall on ipc namespaces.
> 
> The purpose of this restriction is to protect the system from
> inconsistencies when the namespace is unshared. e.g. shared memory ids
> will be removed but not the memory mappings, semaphore ids will be
> removed but the semundos not cleared.
> 
> 
> Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
> Cc: Andrew Morton <akpm@osdl.org>
> Cc: Pavel Emelianov <xemul@openvz.org>
> Cc: Kirill Korotaev <dev@openvz.org>
> Cc: Andrey Savochkin <saw@sw.ru>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: Herbert Poetzl <herbert@13thfloor.at>
> Cc: Sam Vilain <sam.vilain@catalyst.net.nz>
> Cc: Serge E. Hallyn <serue@us.ibm.com>
> Cc: Dave Hansen <haveblue@us.ibm.com>
> 
> ---
>  kernel/fork.c |   23 +++++------------------
>  1 file changed, 5 insertions(+), 18 deletions(-)
> 
> Index: 2.6.18-rc1-mm1/kernel/fork.c
> ===================================================================
> --- 2.6.18-rc1-mm1.orig/kernel/fork.c
> +++ 2.6.18-rc1-mm1/kernel/fork.c
> @@ -1604,7 +1604,6 @@ asmlinkage long sys_unshare(unsigned lon
>  	struct sem_undo_list *new_ulist = NULL;
>  	struct nsproxy *new_nsproxy = NULL, *old_nsproxy = NULL;
>  	struct uts_namespace *uts, *new_uts = NULL;
> -	struct ipc_namespace *ipc, *new_ipc = NULL;
>  
>  	check_unshare_flags(&unshare_flags);
>  
> @@ -1612,12 +1611,12 @@ asmlinkage long sys_unshare(unsigned lon
>  	err = -EINVAL;
>  	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>  				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
> -				CLONE_NEWUTS|CLONE_NEWIPC))
> +				CLONE_NEWUTS))
>  		goto bad_unshare_out;
>  
>  	/* Also return -EINVAL for all unsharable namespaces. May be a
>  	 * -EACCES would be more appropriate ? */
> -	if (unshare_flags & CLONE_NEWUSER)
> +	if (unshare_flags & (CLONE_NEWUSER|CLONE_NEWIPC))
>  		goto bad_unshare_out;
>  
>  	if ((err = unshare_thread(unshare_flags)))
> @@ -1636,20 +1635,18 @@ asmlinkage long sys_unshare(unsigned lon
>  		goto bad_unshare_cleanup_fd;
>  	if ((err = unshare_utsname(unshare_flags, &new_uts)))
>  		goto bad_unshare_cleanup_semundo;
> -	if ((err = unshare_ipcs(unshare_flags, &new_ipc)))
> -		goto bad_unshare_cleanup_uts;
>  
> -	if (new_ns || new_uts || new_ipc) {
> +	if (new_ns || new_uts) {
>  		old_nsproxy = current->nsproxy;
>  		new_nsproxy = dup_namespaces(old_nsproxy);
>  		if (!new_nsproxy) {
>  			err = -ENOMEM;
> -			goto bad_unshare_cleanup_ipc;
> +			goto bad_unshare_cleanup_uts;
>  		}
>  	}
>  
>  	if (new_fs || new_ns || new_sigh || new_mm || new_fd || new_ulist ||
> -				new_uts || new_ipc) {
> +				new_uts) {
>  
>  		task_lock(current);
>  
> @@ -1697,22 +1694,12 @@ asmlinkage long sys_unshare(unsigned lon
>  			new_uts = uts;
>  		}
>  
> -		if (new_ipc) {
> -			ipc = current->nsproxy->ipc_ns;
> -			current->nsproxy->ipc_ns = new_ipc;
> -			new_ipc = ipc;
> -		}
> -
>  		task_unlock(current);
>  	}
>  
>  	if (new_nsproxy)
>  		put_nsproxy(new_nsproxy);
>  
> -bad_unshare_cleanup_ipc:
> -	if (new_ipc)
> -		put_ipc_ns(new_ipc);
> -
>  bad_unshare_cleanup_uts:
>  	if (new_uts)
>  		put_uts_ns(new_uts);
> 
> --
> 


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 7/7] forbid the use of the unshare syscall on ipc namespaces
  2006-07-11 14:10   ` Kirill Korotaev
@ 2006-07-11 15:06     ` Cedric Le Goater
  0 siblings, 0 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11 15:06 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: linux-kernel, Andrew Morton, Pavel Emelianov, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

Kirill Korotaev wrote:

> This patch looks as an overkill for me.

it's a standalone patch. It can be dropped. I think there is some value to
it as we already agree

> If you really care about things you describe, you can forbid unsharing
> in cases:
> 
> 1.
>        undo_list = tsk->sysvsem.undo_list;
>        if (undo_list)
>                REFUSE_UNSHARE;
> 2. vma exists with vma->vm_ops == &shm_vm_ops;
> 3. file opened with f_op == &shm_file_operations

and there are also the netlink sockets mq_notify.

OK, so we agree that ipc namespaces cannot be unshared without extra
checks. I like the firewall approach : it's not safe, don't allow it. Which
is what the patch is doing : we can't unshare ipc namespace safely so let's
just forbid it :

	if (unshare_flags & CLONE_NEWIPC)
		goto bad_unshare_out;

simple, nop ? :)

> I also dislike exec() operation for such sort of things since you can
> have no executable at hands due to changed fs namespace.

what do you mean ? fs namespace doesn't change bc you need it to load the
new process image/interpreter.

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
                   ` (6 preceding siblings ...)
  2006-07-11  7:50 ` [PATCH -mm 7/7] forbid the use of the unshare syscall on ipc namespaces Cedric Le Goater
@ 2006-07-11  8:02 ` Arjan van de Ven
  2006-07-11  8:42   ` Cedric Le Goater
  2006-07-11 18:12 ` H. Peter Anvin
  2006-07-11 20:22 ` Eric W. Biederman
  9 siblings, 1 reply; 107+ messages in thread
From: Arjan van de Ven @ 2006-07-11  8:02 UTC (permalink / raw)
  To: Cedric Le Goater; +Cc: linux-kernel, Dave Hansen

On Tue, 2006-07-11 at 09:50 +0200, Cedric Le Goater wrote:
> Hello !
> 
> The following patchset adds the user namespace and a new syscall execns.
> 
> The user namespace will allow a process to unshare its user_struct table,
> resetting at the same time its own user_struct and all the associated
> accounting.
> 
> The purpose of execns is to make sure that a process unsharing a namespace
> is free from any reference in the previous namespace. the execve() semantic
> seems to be the best candidate as it already flushes the previous process
> context.
> 
> Thanks for reviewing, sharing, flaming !

how does this interact with the unshare() syscall ?
can the unshare syscall be rigged up such that you have the same effect?



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11  8:02 ` [PATCH -mm 0/7] execns syscall and user namespace Arjan van de Ven
@ 2006-07-11  8:42   ` Cedric Le Goater
  0 siblings, 0 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11  8:42 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, Dave Hansen

Arjan van de Ven wrote:

> how does this interact with the unshare() syscall ?

it complements unshare(). The purpose of this syscall is to unshare a
namespace after the process has been flushed.

> can the unshare syscall be rigged up such that you have the same effect?

We need a clean context with no reference in other namespaces to make
unshare safe. It seemed easier to add an improved execve() with an extra
flag than to modify unshare() to make it flush the old exec.

Now, that does not make unshare() useless. It's perfectly acceptable for
uts namespace. But IMO, it's dangerous for ipc namespace and user namespace
which are more complex because they have references all over the place :
network with socket, mm for shmem, files for accounting.

thanks,

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
                   ` (7 preceding siblings ...)
  2006-07-11  8:02 ` [PATCH -mm 0/7] execns syscall and user namespace Arjan van de Ven
@ 2006-07-11 18:12 ` H. Peter Anvin
  2006-07-11 18:26   ` Cedric Le Goater
  2006-07-11 20:22 ` Eric W. Biederman
  9 siblings, 1 reply; 107+ messages in thread
From: H. Peter Anvin @ 2006-07-11 18:12 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Eric W. Biederman, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Cedric Le Goater wrote:
> Hello !
> 
> The following patchset adds the user namespace and a new syscall execns.
> 
> The user namespace will allow a process to unshare its user_struct table,
> resetting at the same time its own user_struct and all the associated
> accounting.
> 
> The purpose of execns is to make sure that a process unsharing a namespace
> is free from any reference in the previous namespace. the execve() semantic
> seems to be the best candidate as it already flushes the previous process
> context.
> 
> Thanks for reviewing, sharing, flaming !
> 

I would like give a strong objection to the naming.  The -ve() suffix in 
execve() isn't jettisonable; it indicates its position within a family 
of functions (only one of which is a true system call.)

execven() would be better name (the -n argument coming after then -e 
argument).  The library could then provide execlen(), execlpn() etc as 
appropriate.

	-hpa

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11 18:12 ` H. Peter Anvin
@ 2006-07-11 18:26   ` Cedric Le Goater
  2006-07-11 18:28     ` H. Peter Anvin
  0 siblings, 1 reply; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11 18:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Eric W. Biederman, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

H. Peter Anvin wrote:

> I would like give a strong objection to the naming.  The -ve() suffix in
> execve() isn't jettisonable; it indicates its position within a family
> of functions (only one of which is a true system call.)
> 
> execven() would be better name (the -n argument coming after then -e
> argument).  The library could then provide execlen(), execlpn() etc as
> appropriate.

I agree. execns() is a shortcut.

This service behaves like execve() if the flag argument is 0, so I guess we
should keep the execve- prefix. However, we could be a bit more explicit on
the nature of this service and call it execve_unshare().

thanks,

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11 18:26   ` Cedric Le Goater
@ 2006-07-11 18:28     ` H. Peter Anvin
  2006-07-11 19:50       ` Ulrich Drepper
  0 siblings, 1 reply; 107+ messages in thread
From: H. Peter Anvin @ 2006-07-11 18:28 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Eric W. Biederman, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Cedric Le Goater wrote:
> H. Peter Anvin wrote:
> 
>> I would like give a strong objection to the naming.  The -ve() suffix in
>> execve() isn't jettisonable; it indicates its position within a family
>> of functions (only one of which is a true system call.)
>>
>> execven() would be better name (the -n argument coming after then -e
>> argument).  The library could then provide execlen(), execlpn() etc as
>> appropriate.
> 
> I agree. execns() is a shortcut.
> 
> This service behaves like execve() if the flag argument is 0, so I guess we
> should keep the execve- prefix. However, we could be a bit more explicit on
> the nature of this service and call it execve_unshare().
> 

How about execveu()?  -n looked a bit weird to me, mostly because the 
"le" form would be execlen() which looks like something completely 
different...

	-hpa

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11 18:28     ` H. Peter Anvin
@ 2006-07-11 19:50       ` Ulrich Drepper
  2006-07-11 21:50         ` Cedric Le Goater
  0 siblings, 1 reply; 107+ messages in thread
From: Ulrich Drepper @ 2006-07-11 19:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

On 7/11/06, H. Peter Anvin <hpa@zytor.com> wrote:
> How about execveu()?  -n looked a bit weird to me, mostly because the
> "le" form would be execlen() which looks like something completely
> different...

I would prefer a more general parameter.  With this extension it is
expected to have six new interfaces.  I really don't want to repeat
this if somebody comes up with yet another nice extension.

So, how about generalizing the parameter.  Make is a 'flags'
parameter, assign a number of bits to the unshare functionality and
leave the rest available.  Use a 'f' suffix, perhaps.  Then in future
more bits can be defined and, if necessary, additional parameters can
be added depending on set flags.  The userspace prototypes can then if
absolutely necessary be extended with an ellipsis.  Not nice but not
as bad as adding more and more intefaces.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11 19:50       ` Ulrich Drepper
@ 2006-07-11 21:50         ` Cedric Le Goater
  2006-07-11 21:57           ` H. Peter Anvin
  2006-07-12  0:28           ` H. Peter Anvin
  0 siblings, 2 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11 21:50 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: H. Peter Anvin, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

Ulrich Drepper wrote:
> On 7/11/06, H. Peter Anvin <hpa@zytor.com> wrote:
>> How about execveu()?  -n looked a bit weird to me, mostly because the
>> "le" form would be execlen() which looks like something completely
>> different...
> 
> I would prefer a more general parameter.  With this extension it is
> expected to have six new interfaces.  I really don't want to repeat
> this if somebody comes up with yet another nice extension.
> 
> So, how about generalizing the parameter.  Make is a 'flags'
> parameter, assign a number of bits to the unshare functionality and
> leave the rest available.  Use a 'f' suffix, perhaps.  Then in future
> more bits can be defined and, if necessary, additional parameters can
> be added depending on set flags.  The userspace prototypes can then if
> absolutely necessary be extended with an ellipsis.  Not nice but not
> as bad as adding more and more intefaces.

How's that ?

int execvef(int flags, const char *filename, char *const argv [], char
*const envp[]);

initially, flags would be :

#define EXECVEF_NEWNS	0x00000100
#define EXECVEF_NEWIPC	0x00000200
#define EXECVEF_NEWUTS	0x00000400
#define EXECVEF_NEWUSER	0x00000800

execvef() would behave like execve() if flags == 0 and would return EINVAL
if flags is invalid. unshare of a namespace can fail and usually returns
ENOMEM.

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11 21:50         ` Cedric Le Goater
@ 2006-07-11 21:57           ` H. Peter Anvin
  2006-07-12  0:16             ` Ulrich Drepper
  2006-07-12  0:28           ` H. Peter Anvin
  1 sibling, 1 reply; 107+ messages in thread
From: H. Peter Anvin @ 2006-07-11 21:57 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Ulrich Drepper, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

Cedric Le Goater wrote:
> 
> How's that ?
> 
> int execvef(int flags, const char *filename, char *const argv [], char
> *const envp[]);
> 
> initially, flags would be :
> 
> #define EXECVEF_NEWNS	0x00000100
> #define EXECVEF_NEWIPC	0x00000200
> #define EXECVEF_NEWUTS	0x00000400
> #define EXECVEF_NEWUSER	0x00000800
> 
> execvef() would behave like execve() if flags == 0 and would return EINVAL
> if flags is invalid. unshare of a namespace can fail and usually returns
> ENOMEM.
> 

If flags comes first, I would rather like to call it execfve(), or 
perhaps execxve() ("extended") or execove() ("options").  execfve() 
sounds like it executes a file descriptor (which would probably be 
called fexecve()).

Perhaps more seriously, if we're adding more functionality already, it 
should acquire -at functionality (execveat) and take a directory argument.

	-hpa


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11 21:57           ` H. Peter Anvin
@ 2006-07-12  0:16             ` Ulrich Drepper
  2006-07-12  0:25               ` H. Peter Anvin
  0 siblings, 1 reply; 107+ messages in thread
From: Ulrich Drepper @ 2006-07-12  0:16 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel, Andrew Morton

On 7/11/06, H. Peter Anvin <hpa@zytor.com> wrote:
> > #define EXECVEF_NEWNS 0x00000100
> > #define EXECVEF_NEWIPC        0x00000200
> > #define EXECVEF_NEWUTS        0x00000400
> > #define EXECVEF_NEWUSER       0x00000800

Yes on these.


> If flags comes first, I would rather like to call it execfve(), or
> perhaps execxve() ("extended") or execove() ("options").  execfve()
> sounds like it executes a file descriptor (which would probably be
> called fexecve()).

I think execfve is fine.


> Perhaps more seriously, if we're adding more functionality already, it
> should acquire -at functionality (execveat) and take a directory argument.

We have fexecve already.  Adding -at variants is probably not the best
idea, it's confusing.  Note, that fexecve only takes a file
descriptor, not a file descriptor plus file name.

The only reason I could see for changing this is thatfexecve depends
on /proc.  But there is so much other functionality which won't work
if /proc isn't mounted that I'd rank this low.  I'm fine with just
adding execfve.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-12  0:16             ` Ulrich Drepper
@ 2006-07-12  0:25               ` H. Peter Anvin
  0 siblings, 0 replies; 107+ messages in thread
From: H. Peter Anvin @ 2006-07-12  0:25 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: linux-kernel, Andrew Morton

Ulrich Drepper wrote:
> On 7/11/06, H. Peter Anvin <hpa@zytor.com> wrote:
>> > #define EXECVEF_NEWNS 0x00000100
>> > #define EXECVEF_NEWIPC        0x00000200
>> > #define EXECVEF_NEWUTS        0x00000400
>> > #define EXECVEF_NEWUSER       0x00000800
> 
> Yes on these.
> 
> 
>> If flags comes first, I would rather like to call it execfve(), or
>> perhaps execxve() ("extended") or execove() ("options").  execfve()
>> sounds like it executes a file descriptor (which would probably be
>> called fexecve()).
> 
> I think execfve is fine.
> 
> 
>> Perhaps more seriously, if we're adding more functionality already, it
>> should acquire -at functionality (execveat) and take a directory 
>> argument.
> 
> We have fexecve already.  Adding -at variants is probably not the best
> idea, it's confusing.  Note, that fexecve only takes a file
> descriptor, not a file descriptor plus file name.
> 
> The only reason I could see for changing this is thatfexecve depends
> on /proc.  But there is so much other functionality which won't work
> if /proc isn't mounted that I'd rank this low.  I'm fine with just
> adding execfve.

It seems to me to make a lot of sense to make it execveat(), then.  That 
way it would provide the equivalent functionality of both execve() and 
fexecve(), plus additional functionality.

	-hpa

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11 21:50         ` Cedric Le Goater
  2006-07-11 21:57           ` H. Peter Anvin
@ 2006-07-12  0:28           ` H. Peter Anvin
  1 sibling, 0 replies; 107+ messages in thread
From: H. Peter Anvin @ 2006-07-12  0:28 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Ulrich Drepper, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Eric W. Biederman, Herbert Poetzl, Sam Vilain,
	Serge E. Hallyn, Dave Hansen

Cedric Le Goater wrote:
> 
> How's that ?
> 
> int execvef(int flags, const char *filename, char *const argv [], char
> *const envp[]);
> 
> initially, flags would be :
> 
> #define EXECVEF_NEWNS	0x00000100
> #define EXECVEF_NEWIPC	0x00000200
> #define EXECVEF_NEWUTS	0x00000400
> #define EXECVEF_NEWUSER	0x00000800
> 
> execvef() would behave like execve() if flags == 0 and would return EINVAL
> if flags is invalid. unshare of a namespace can fail and usually returns
> ENOMEM.
> 

To be more specific, I guess, what I'm proposing is:

int execxveat(int flags, int dirfd, const char *filename,
	      char * const *argv, char * const *envp);

... with the -x- for the flags field; that can be dropped since it's 
already established that -at() variants can take additional flags.

	-hpa


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
                   ` (8 preceding siblings ...)
  2006-07-11 18:12 ` H. Peter Anvin
@ 2006-07-11 20:22 ` Eric W. Biederman
  2006-07-11 21:28   ` Cedric Le Goater
  2006-07-12 11:11   ` Kirill Korotaev
  9 siblings, 2 replies; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-11 20:22 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Cedric Le Goater <clg@fr.ibm.com> writes:

> Hello !
>
> The following patchset adds the user namespace and a new syscall execns.
>
> The user namespace will allow a process to unshare its user_struct table,
> resetting at the same time its own user_struct and all the associated
> accounting.
>
> The purpose of execns is to make sure that a process unsharing a namespace
> is free from any reference in the previous namespace. the execve() semantic
> seems to be the best candidate as it already flushes the previous process
> context.
>
> Thanks for reviewing, sharing, flaming !


I haven't had a chance to do a thorough review yet but why is
this needed?

What can be left shared by switching to a new namespace and then
execing an executable?

Is it not possible to ensure what you are trying to ensure with
a good user space executable?

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11 20:22 ` Eric W. Biederman
@ 2006-07-11 21:28   ` Cedric Le Goater
  2006-07-12  3:24     ` Eric W. Biederman
  2006-07-12 11:11   ` Kirill Korotaev
  1 sibling, 1 reply; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-11 21:28 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Hello eric,

Eric W. Biederman wrote:

>> The following patchset adds the user namespace and a new syscall execns.
>>
>> The user namespace will allow a process to unshare its user_struct table,
>> resetting at the same time its own user_struct and all the associated
>> accounting.
>>
>> The purpose of execns is to make sure that a process unsharing a namespace
>> is free from any reference in the previous namespace. the execve() semantic
>> seems to be the best candidate as it already flushes the previous process
>> context.
>>
>> Thanks for reviewing, sharing, flaming !
> 
> 
> I haven't had a chance to do a thorough review yet but why is
> this needed?
> 
> What can be left shared by switching to a new namespace and then
> execing an executable?
> 
> Is it not possible to ensure what you are trying to ensure with
> a good user space executable?

unshare() is unsafe for some namespaces because namespaces can reference
each other. For the ipc namespace, example are shm ids vs. vma, sem ids vs.
semundos, msq vs. netlink sockets. for the user namespace, open files. So
it seems reasonable to provide a way to unshare namespaces from a clean
process context.

Now, if you try to do that from user space, you will call unshare() then
execve(), which leaves plenty of room and time for nasty things to happen
in between the 2 calls.

C.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11 21:28   ` Cedric Le Goater
@ 2006-07-12  3:24     ` Eric W. Biederman
  2006-07-12 13:05       ` Cedric Le Goater
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-12  3:24 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Cedric Le Goater <clg@fr.ibm.com> writes:

> Hello eric,
>
> Eric W. Biederman wrote:
>
>>> The following patchset adds the user namespace and a new syscall execns.
>>>
>>> The user namespace will allow a process to unshare its user_struct table,
>>> resetting at the same time its own user_struct and all the associated
>>> accounting.
>>>
>>> The purpose of execns is to make sure that a process unsharing a namespace
>>> is free from any reference in the previous namespace. the execve() semantic
>>> seems to be the best candidate as it already flushes the previous process
>>> context.
>>>
>>> Thanks for reviewing, sharing, flaming !
>> 
>> 
>> I haven't had a chance to do a thorough review yet but why is
>> this needed?
>> 
>> What can be left shared by switching to a new namespace and then
>> execing an executable?
>> 
>> Is it not possible to ensure what you are trying to ensure with
>> a good user space executable?
>
> unshare() is unsafe for some namespaces because namespaces can reference
> each other. For the ipc namespace, example are shm ids vs. vma, sem ids vs.
> semundos, msq vs. netlink sockets. for the user namespace, open files. So
> it seems reasonable to provide a way to unshare namespaces from a clean
> process context.

It is perfectly legitimate to have a shared memory region memory mapped
from another namespace.  Yes sem ids versus semunds is an issue but it
just requires you to unshare one at the same time you unshare the other,
or to simply clone a new namespace.  I'm not familiar with the msq vs netlink
socket issue.  As for the user namespace vs open files.  If we have
any issues with open files in any namespace that sounds like an implementation
bug to me.

I'm not convinced the problems you are seeing are not implementation bugs.
For some things clone is still more general then unshare, and clone should
be considered the primary user interface, not unshare.

> Now, if you try to do that from user space, you will call unshare() then
> execve(), which leaves plenty of room and time for nasty things to happen
> in between the 2 calls.

I will look more closely but I think there is an important point being missed
somewhere.  Pieces of the kernel interact in all sorts of weird and unexpected
ways.  If we rely on ourselves always being in the right magic namespace for
things to work correctly we are setting ourselves up for trouble.  If we know
a namespace implementation will work even when a process has access to entities
in multiple instances of that namespace we are in much better shape.

Eric

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-12  3:24     ` Eric W. Biederman
@ 2006-07-12 13:05       ` Cedric Le Goater
  2006-07-12 16:56         ` Eric W. Biederman
  0 siblings, 1 reply; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-12 13:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Hello !

Hopefully, we will soon see each other at OLS. We need some synchronous
interaction !

Eric W. Biederman wrote:

>>> Is it not possible to ensure what you are trying to ensure with
>>> a good user space executable?
>> unshare() is unsafe for some namespaces because namespaces can reference
>> each other. For the ipc namespace, example are shm ids vs. vma, sem ids vs.
>> semundos, msq vs. netlink sockets. for the user namespace, open files. So
>> it seems reasonable to provide a way to unshare namespaces from a clean
>> process context.
> 
> It is perfectly legitimate to have a shared memory region memory mapped
> from another namespace. 

then after unshare, a process can be in ipc namespace B with a shared
memory segment from ipc namespace A without any id for this segment. this
is not very consistent. the same process will also be able to modify the
ipc namespace B without being in this namespace. ugly. It looks like an
issue that should be solved.

I think namespace should enforce strict isolation. nop ?

> Yes sem ids versus semunds is an issue but it just requires you to unshare
> one at the same time you unshare the other, or to simply clone a new
> namespace.

hmm, semids the from ipc namespace are stored in task->sysv_sem. i would
forbid the unshare/clone in that case or flush the semundos like in
exit_sem(). but it's easier not to have any, like in a clean process image.

> I'm not familiar with the msq vs netlink socket issue.  

mq_notify can use a netlink socket to send an event back user space.

> As for the user namespace vs open files.  If we have any issues with open
> files in any namespace that sounds like an implementation bug to me.

user_struct does accounting on process, open files, locked memory, signals,
etc. if you unshare such an object, you will need to unshare all others
namespaces to be consistent. again having a clean process image is easier ...

> I'm not convinced the problems you are seeing are not implementation bugs.
> For some things clone is still more general then unshare, and clone should
> be considered the primary user interface, not unshare.

agree on that, i might be focusing a bit too much on the unshare syscall.
we should work on clone to make sure it has the required restrictions. The
system is really interlinked and not all namespaces can be unshared standalone.

>> Now, if you try to do that from user space, you will call unshare() then
>> execve(), which leaves plenty of room and time for nasty things to happen
>> in between the 2 calls.
> 
> I will look more closely but I think there is an important point being missed
> somewhere.  Pieces of the kernel interact in all sorts of weird and unexpected
> ways.  If we rely on ourselves always being in the right magic namespace for
> things to work correctly we are setting ourselves up for trouble.  If we know
> a namespace implementation will work even when a process has access to entities
> in multiple instances of that namespace we are in much better shape.

having a clean process image is IMHO required for some namespaces :

http://marc.theaimsgroup.com/?l=linux-kernel&m=113881171017330&w=2

thanks,

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-12 13:05       ` Cedric Le Goater
@ 2006-07-12 16:56         ` Eric W. Biederman
  2006-07-13 16:13           ` Cedric Le Goater
  0 siblings, 1 reply; 107+ messages in thread
From: Eric W. Biederman @ 2006-07-12 16:56 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Cedric Le Goater <clg@fr.ibm.com> writes:

> Hello !
>
> Hopefully, we will soon see each other at OLS. We need some synchronous
> interaction !
>
> Eric W. Biederman wrote:
>
>>>> Is it not possible to ensure what you are trying to ensure with
>>>> a good user space executable?
>>> unshare() is unsafe for some namespaces because namespaces can reference
>>> each other. For the ipc namespace, example are shm ids vs. vma, sem ids vs.
>>> semundos, msq vs. netlink sockets. for the user namespace, open files. So
>>> it seems reasonable to provide a way to unshare namespaces from a clean
>>> process context.
>> 
>> It is perfectly legitimate to have a shared memory region memory mapped
>> from another namespace. 
>
> then after unshare, a process can be in ipc namespace B with a shared
> memory segment from ipc namespace A without any id for this segment. this
> is not very consistent. the same process will also be able to modify the
> ipc namespace B without being in this namespace. ugly. It looks like an
> issue that should be solved.
>
> I think namespace should enforce strict isolation. nop ?
>
>> Yes sem ids versus semunds is an issue but it just requires you to unshare
>> one at the same time you unshare the other, or to simply clone a new
>> namespace.
>
> hmm, semids the from ipc namespace are stored in task->sysv_sem. i would
> forbid the unshare/clone in that case or flush the semundos like in
> exit_sem(). but it's easier not to have any, like in a clean process image.
>
>> I'm not familiar with the msq vs netlink socket issue.  
>
> mq_notify can use a netlink socket to send an event back user space.

Ok.  That one is a mess, and I almost recall seeing that.  A big
chunk of that is a general netlink socket problem.  Getting enough
context in a netlink socket is a challenge because you can't use current.
I do think solving that is achievable though.  Just very peculiar.

>> As for the user namespace vs open files.  If we have any issues with open
>> files in any namespace that sounds like an implementation bug to me.
>
> user_struct does accounting on process, open files, locked memory, signals,
> etc. if you unshare such an object, you will need to unshare all others
> namespaces to be consistent. again having a clean process image is easier ...

I just don't see it.    The accounting is about objects and the namespaces
are about names of those objects.

>> I'm not convinced the problems you are seeing are not implementation bugs.
>> For some things clone is still more general then unshare, and clone should
>> be considered the primary user interface, not unshare.
>
> agree on that, i might be focusing a bit too much on the unshare syscall.
> we should work on clone to make sure it has the required restrictions. The
> system is really interlinked and not all namespaces can be unshared standalone.

I completely agree that there are pieces that interlink.

>>> Now, if you try to do that from user space, you will call unshare() then
>>> execve(), which leaves plenty of room and time for nasty things to happen
>>> in between the 2 calls.
>> 
>> I will look more closely but I think there is an important point being missed
>> somewhere.  Pieces of the kernel interact in all sorts of weird and unexpected
>> ways.  If we rely on ourselves always being in the right magic namespace for
>> things to work correctly we are setting ourselves up for trouble.  If we know
>> a namespace implementation will work even when a process has access to
> entities
>> in multiple instances of that namespace we are in much better shape.
>
> having a clean process image is IMHO required for some namespaces :
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113881171017330&w=2

That message is a terrible example.  Unless you are thinking of something
farther down that thread.  User space getting confused when it creates
a container is just an implementation of the container creation code.

Now I'm not certain what you mean by a clean process image, as there are always
left over pieces from the parent.  Clone creates a new task_struct.  exec replaces
the executable.  They both keep files open.


Eric


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-12 16:56         ` Eric W. Biederman
@ 2006-07-13 16:13           ` Cedric Le Goater
  0 siblings, 0 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-13 16:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, Andrew Morton, Kirill Korotaev, Andrey Savochkin,
	Herbert Poetzl, Sam Vilain, Serge E. Hallyn, Dave Hansen

Eric W. Biederman wrote:
>>
>> http://marc.theaimsgroup.com/?l=linux-kernel&m=113881171017330&w=2
> 
> That message is a terrible example.  Unless you are thinking of something
> farther down that thread.  User space getting confused when it creates
> a container is just an implementation of the container creation code.
>
> Now I'm not certain what you mean by a clean process image, as there are always
> left over pieces from the parent.  Clone creates a new task_struct.  exec replaces
> the executable.  They both keep files open.

next week, we need to work on that topic : how to assemble namespaces to
create a container. I'm sure this will cut down the email volume due to
confusion.

thanks,

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-11 20:22 ` Eric W. Biederman
  2006-07-11 21:28   ` Cedric Le Goater
@ 2006-07-12 11:11   ` Kirill Korotaev
  2006-07-12 13:10     ` Cedric Le Goater
  1 sibling, 1 reply; 107+ messages in thread
From: Kirill Korotaev @ 2006-07-12 11:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Cedric Le Goater, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

> I haven't had a chance to do a thorough review yet but why is
> this needed?
> 
> What can be left shared by switching to a new namespace and then
> execing an executable?
> 
> Is it not possible to ensure what you are trying to ensure with
> a good user space executable?

I agree with Eric. In OpenVZ we don't do exec(), because executable itself ensures
correct environment.

Do we need to overcomplicate kernel in this regard?

Thanks,
Kirill


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 0/7] execns syscall and user namespace
  2006-07-12 11:11   ` Kirill Korotaev
@ 2006-07-12 13:10     ` Cedric Le Goater
  0 siblings, 0 replies; 107+ messages in thread
From: Cedric Le Goater @ 2006-07-12 13:10 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Eric W. Biederman, linux-kernel, Andrew Morton, Kirill Korotaev,
	Andrey Savochkin, Herbert Poetzl, Sam Vilain, Serge E. Hallyn,
	Dave Hansen

Hello !

Kirill Korotaev wrote:
>> I haven't had a chance to do a thorough review yet but why is
>> this needed?
>>
>> What can be left shared by switching to a new namespace and then
>> execing an executable?
>>
>> Is it not possible to ensure what you are trying to ensure with
>> a good user space executable?
> 
> I agree with Eric. In OpenVZ we don't do exec(), because executable
> itself ensures correct environment.

Could briefly explain how the first process is started in a VPS ? Sorry for
being lazy and not looking at the code, but it would be interesting for all
to have some info.

> Do we need to overcomplicate kernel in this regard ?

I don't think it's an amazing kernel overkill. Just an extension to exec
with some flags to set up the environement in which the exec will be done.
there might another way to do it.

thanks,

C.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
@ 2006-07-15 17:39 Al Boldi
  2006-07-16 12:19 ` Kyle Moffett
  0 siblings, 1 reply; 107+ messages in thread
From: Al Boldi @ 2006-07-15 17:39 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel

Trond Myklebust wrote:
> On Sat, 2006-07-15 at 06:35 -0600, Eric W. Biederman wrote:
> > I hope the confusion has passed for Trond.  My impression was he
> > figured this was per process data so it didn't make sense any where
> > near a filesystem, and the superblock was the last place it should
> > be.
>
> You are still using the wrong abstraction. Data that is not global to
> the entire machine has absolutely _no_ place being put into the
> superblock. It doesn't matter if it is process-specific,
> container-specific or whatever-else-specific, it will still be vetoed.
>
> If your real problem is uid/gid mapping on top of generic filesystems,
> then have you looked into the *BSD solution of using a stackable
> filesystem (i.e. umapfs)?

A stackable FS is really overkill here, when all that is needed is a simple 
mapping.  An easy solution would be, to allow for perMount Handlers via 
hooks into the VFS, as was suggested in the '[RFC] VFS: FS CoW using 
redirection' thread.

Thanks!

--
Al


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-15 17:39 [PATCH -mm 5/7] add " Al Boldi
@ 2006-07-16 12:19 ` Kyle Moffett
  2006-07-17 11:25   ` Al Boldi
  0 siblings, 1 reply; 107+ messages in thread
From: Kyle Moffett @ 2006-07-16 12:19 UTC (permalink / raw)
  To: Al Boldi; +Cc: linux-fsdevel, linux-kernel

On Jul 15, 2006, at 13:39:50, Al Boldi wrote:
> Trond Myklebust wrote:
>> On Sat, 2006-07-15 at 06:35 -0600, Eric W. Biederman wrote:
>>> I hope the confusion has passed for Trond.  My impression was he  
>>> figured this was per process data so it didn't make sense any  
>>> where near a filesystem, and the superblock was the last place it  
>>> should be.
>>
>> You are still using the wrong abstraction. Data that is not global  
>> to the entire machine has absolutely _no_ place being put into the  
>> superblock. It doesn't matter if it is process-specific, container- 
>> specific or whatever-else-specific, it will still be vetoed.
>>
>> If your real problem is uid/gid mapping on top of generic  
>> filesystems, then have you looked into the *BSD solution of using  
>> a stackable filesystem (i.e. umapfs)?
>
> A stackable FS is really overkill here, when all that is needed is  
> a simple mapping.  An easy solution would be, to allow for perMount  
> Handlers via hooks into the VFS, as was suggested in the '[RFC]  
> VFS: FS CoW using redirection' thread.

IMHO a UID mapping is completely the wrong solution for this.  The  
problem is the the subject (the process) and object (the filesystem)  
place different meanings on different UIDs, in other words their UIDs  
are in different namespaces.  The result is that you should tag that  
filesystem (vfsmount, really) with a different namespace tag and fix  
the namespace system to properly handle cross-namespace permissions,  
not forcibly graft on some fragile mapping system.  By using the  
keyring system for foreign-namespace UID permissions the actual  
permissions fall out quite nicely.

Cheers,
Kyle Moffett


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH -mm 5/7] add user namespace
  2006-07-16 12:19 ` Kyle Moffett
@ 2006-07-17 11:25   ` Al Boldi
  0 siblings, 0 replies; 107+ messages in thread
From: Al Boldi @ 2006-07-17 11:25 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: linux-fsdevel, linux-kernel

Kyle Moffett wrote:
> On Jul 15, 2006, at 13:39:50, Al Boldi wrote:
> > Trond Myklebust wrote:
> >> On Sat, 2006-07-15 at 06:35 -0600, Eric W. Biederman wrote:
> >>> I hope the confusion has passed for Trond.  My impression was he
> >>> figured this was per process data so it didn't make sense any
> >>> where near a filesystem, and the superblock was the last place it
> >>> should be.
> >>
> >> You are still using the wrong abstraction. Data that is not global
> >> to the entire machine has absolutely _no_ place being put into the
> >> superblock. It doesn't matter if it is process-specific, container-
> >> specific or whatever-else-specific, it will still be vetoed.
> >>
> >> If your real problem is uid/gid mapping on top of generic
> >> filesystems, then have you looked into the *BSD solution of using
> >> a stackable filesystem (i.e. umapfs)?
> >
> > A stackable FS is really overkill here, when all that is needed is
> > a simple mapping.  An easy solution would be, to allow for perMount
> > Handlers via hooks into the VFS, as was suggested in the '[RFC]
> > VFS: FS CoW using redirection' thread.
>
> IMHO a UID mapping is completely the wrong solution for this.

Sure, maybe it is, but if it were needed, why would I implement it using an 
extra VFS below the VFS?

> The problem is the subject (the process) and object (the filesystem)
> place different meanings on different UIDs, in other words their UIDs
> are in different namespaces.  The result is that you should tag that
> filesystem (vfsmount, really) with a different namespace tag and fix
> the namespace system to properly handle cross-namespace permissions,
> not forcibly graft on some fragile mapping system.

Grafting is probably the wrong way to implement this too, so what I suggest 
is to have the VFS provide hooks (aka plug-ins) for perMount Handlers, which 
would relieve developers from the need to either merge their requirements 
with the VFS, or introduce a subVFS.  This would imply more freedom, while 
at the same time reducing overhead.

> By using the
> keyring system for foreign-namespace UID permissions the actual
> permissions fall out quite nicely.

Maybe it is, for this specific situation.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2006-07-17 11:24 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-11  7:50 [PATCH -mm 0/7] execns syscall and user namespace Cedric Le Goater
2006-07-11  7:50 ` [PATCH -mm 1/7] add execns syscall core routine Cedric Le Goater
2006-07-11  7:50 ` [PATCH -mm 2/7] add execns syscall to s390 Cedric Le Goater
2006-07-11 13:44   ` Martin Schwidefsky
2006-07-11 13:44   ` Martin Schwidefsky
2006-07-11 14:44     ` Cedric Le Goater
2006-07-11 14:54       ` Martin Schwidefsky
2006-07-11 15:43         ` Cedric Le Goater
2006-07-11  7:50 ` [PATCH -mm 3/7] add execns syscall to x86_64 Cedric Le Goater
2006-07-11  7:50 ` [PATCH -mm 4/7] add execns syscall to i386 Cedric Le Goater
2006-07-11  7:50 ` [PATCH -mm 5/7] add user namespace Cedric Le Goater
2006-07-11 16:39   ` Kirill Korotaev
2006-07-11 17:38     ` Cedric Le Goater
2006-07-12 11:21       ` Kirill Korotaev
2006-07-13 16:01         ` Cedric Le Goater
2006-07-12  3:33     ` Eric W. Biederman
2006-07-12 11:13       ` Kirill Korotaev
2006-07-12 18:10         ` Eric W. Biederman
2006-07-13 17:00           ` Cedric Le Goater
2006-07-13 18:07             ` Eric W. Biederman
2006-07-13 18:21             ` Eric W. Biederman
2006-07-13 18:31               ` Dave Hansen
2006-07-13 18:54                 ` Eric W. Biederman
2006-07-12  3:46   ` Eric W. Biederman
2006-07-12 12:05     ` Herbert Poetzl
2006-07-12 17:09       ` Eric W. Biederman
2006-07-12 14:00     ` Cedric Le Goater
2006-07-12 17:24       ` Eric W. Biederman
2006-07-13 17:36         ` Cedric Le Goater
2006-07-13 17:47           ` Serge E. Hallyn
2006-07-13 18:14             ` Eric W. Biederman
2006-07-13 18:29               ` Dave Hansen
2006-07-13 19:02                 ` Eric W. Biederman
2006-07-13 20:03                   ` Dave Hansen
2006-07-14  3:45                     ` Eric W. Biederman
2006-07-14 14:28                       ` Dave Hansen
2006-07-14 15:13                         ` Eric W. Biederman
2006-07-14 16:29                           ` Serge E. Hallyn
2006-07-14 16:49                             ` Eric W. Biederman
2006-07-14 16:55                               ` Dave Hansen
2006-07-14 17:08                                 ` Serge E. Hallyn
2006-07-14 17:19                                   ` Dave Hansen
2006-07-14 17:36                                     ` Eric W. Biederman
2006-07-14 18:15                                       ` Trond Myklebust
2006-07-14 18:40                                         ` Eric W. Biederman
2006-07-14 21:04                                           ` Trond Myklebust
2006-07-15  4:09                                             ` Eric W. Biederman
2006-07-15  4:35                                               ` Kyle Moffett
2006-07-15 12:35                                                 ` Eric W. Biederman
2006-07-15 13:25                                                   ` Kyle Moffett
2006-07-15 15:54                                                   ` Dave Hansen
2006-07-15 17:01                                                   ` Trond Myklebust
2006-07-15 23:29                                                     ` Eric W. Biederman
2006-07-16 16:18                                                       ` Dave Hansen
2006-07-14 17:14                                 ` Eric W. Biederman
2006-07-16  8:36                                 ` Kirill Korotaev
2006-07-16 10:08                                   ` Eric W. Biederman
2006-07-14 17:05                               ` Serge E. Hallyn
2006-07-14 17:50                                 ` Kyle Moffett
2006-07-15 11:33                                   ` Serge E. Hallyn
2006-07-14 17:56                                 ` Eric W. Biederman
2006-07-14 16:35                           ` Dave Hansen
2006-07-13 21:41                   ` Serge E. Hallyn
2006-07-14  3:52                     ` Eric W. Biederman
2006-07-14 14:02                       ` Serge E. Hallyn
2006-07-14 14:50                         ` Eric W. Biederman
2006-07-14 16:39                           ` Serge E. Hallyn
2006-07-14 17:18                             ` Eric W. Biederman
2006-07-14 17:24                               ` Dave Hansen
2006-07-14 18:06                                 ` Eric W. Biederman
2006-07-14 18:42                                   ` Dave Hansen
2006-07-14 19:07                                     ` Eric W. Biederman
2006-07-13 17:59           ` Eric W. Biederman
2006-07-13 21:22             ` Serge E. Hallyn
2006-07-14  3:50               ` Eric W. Biederman
2006-07-14 14:17         ` Serge E. Hallyn
2006-07-14 15:05           ` Eric W. Biederman
2006-07-14 16:46             ` Serge E. Hallyn
2006-07-14 16:58               ` Eric W. Biederman
2006-07-14 15:43           ` Kyle Moffett
2006-07-14 16:13             ` Eric W. Biederman
2006-07-11  7:50 ` [PATCH -mm 6/7] add the user namespace to the execns syscall Cedric Le Goater
2006-07-11  7:50 ` [PATCH -mm 7/7] forbid the use of the unshare syscall on ipc namespaces Cedric Le Goater
2006-07-11 14:10   ` Kirill Korotaev
2006-07-11 15:06     ` Cedric Le Goater
2006-07-11  8:02 ` [PATCH -mm 0/7] execns syscall and user namespace Arjan van de Ven
2006-07-11  8:42   ` Cedric Le Goater
2006-07-11 18:12 ` H. Peter Anvin
2006-07-11 18:26   ` Cedric Le Goater
2006-07-11 18:28     ` H. Peter Anvin
2006-07-11 19:50       ` Ulrich Drepper
2006-07-11 21:50         ` Cedric Le Goater
2006-07-11 21:57           ` H. Peter Anvin
2006-07-12  0:16             ` Ulrich Drepper
2006-07-12  0:25               ` H. Peter Anvin
2006-07-12  0:28           ` H. Peter Anvin
2006-07-11 20:22 ` Eric W. Biederman
2006-07-11 21:28   ` Cedric Le Goater
2006-07-12  3:24     ` Eric W. Biederman
2006-07-12 13:05       ` Cedric Le Goater
2006-07-12 16:56         ` Eric W. Biederman
2006-07-13 16:13           ` Cedric Le Goater
2006-07-12 11:11   ` Kirill Korotaev
2006-07-12 13:10     ` Cedric Le Goater
  -- strict thread matches above, loose matches on Subject: below --
2006-07-15 17:39 [PATCH -mm 5/7] add " Al Boldi
2006-07-16 12:19 ` Kyle Moffett
2006-07-17 11:25   ` Al Boldi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox