All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] namespaces: introduce sys_hijack (v11)
@ 2008-07-31 18:32 Serge E. Hallyn
       [not found] ` <20080731183213.GA12033-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Serge E. Hallyn @ 2008-07-31 18:32 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Containers

Hi Pavel,

Here is the 'hijack' patch that was mentioned during the namespaces
part of the containers mini-summit.  It's a proposed way of entering
namespaces.

It's been rotting for awhile as you can see by the changelog, but
hopefully I updated it sufficiently and correctly.

-serge

From 9a7e1c11cd96435d0d27d28e4508f887d6dbf7ed Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Date: Thu, 10 Jul 2008 11:51:38 -0500
Subject: [PATCH 1/1] namespaces: introduce sys_hijack (v11)

Move most of do_fork() into a new do_fork_task() which acts on
a new argument, cgroup, rather than on current.  The original
process actually forks and if passed a non-NULL cgroup then
the new process's cgroup and namespaces are taken from the
target cgroup specified.  If passed a NULL cgroup, fork
behaves exactly as before, thus do_fork() becomes a call to
do_fork_task(NULL, ...).

Introduce sys_hijack (for i386 and s390 only so far).  An open
fd for a cgroup 'tasks' file is specified.  The main purpose
is to allow entering an empty cgroup without having to keep a
task alive in the target cgroup.  Only the cgroup and nsproxy
are copied from the cgroup.  Security, user, and rootfs info
is not retained in the cgroups and so cannot be copied to the
child task.

In order to hijack a cgroup, you must have CAP_SYS_ADMIN and
be entering a decendent of your current cgroup.

The effect is a sort of namespace enter.  The following program
uses sys_hijack to 'enter' all namespaces of the specified
cgroup. For instance in one terminal, do

	mount -t cgroup -ons cgroup /cgroup
	hostname
	  qemu
	ns_exec -u /bin/sh
	  hostname serge
          echo $$
            2996
	  cat /proc/$$/cgroup
	    ns:/node_2996

In another terminal then do

	hostname
	  qemu
	cat /proc/$$/cgroup
	  ns:/
	hijack /cgroup/node_2996/tasks
	  hostname
	    serge
	  cat /proc/$$/cgroup
	    ns:/node_2996

Changelog:
  Jul 31 2008:  Put fs_struct in ns_cgroup, and hijack it in
  		addition to the nsproxy.
  Jul 10 2008:  Port to recent -mm (cope with cgroup changes)
  Aug 23 2007:	send a stop signal to the hijacked process
		(like ptrace does).
  Oct 09 2007:	Update for 2.6.23-rc8-mm2 (mainly pidns)
		Don't take task_lock under rcu_read_lock
		Send hijacked process to cgroup_fork() as
		the first argument.
		Removed some unneeded task_locks.
  Oct 16 2007:	Fix bug introduced into alloc_pid.
  Oct 16 2007:	Add 'int which' argument to sys_hijack to
		allow later expansion to use cgroup in place
		of pid to specify what to hijack.
  Oct 24 2007:	Implement hijack by open cgroup file.
  Nov 02 2007:	Switch copying of task info: do full copy
		from current, then copy relevant pieces from
		hijacked task.
  Nov 06 2007:	Verbatim task_struct copy now comes from current,
		after which copy_hijackable_taskinfo() copies
		relevant context pieces from the hijack source.
  Nov 07 2007:	Move arch-independent hijack code to kernel/fork.c
  Nov 07 2007:	powerpc and x86_64 support (Mark Nelson)
  Nov 07 2007:	Don't allow hijacking members of same session.
  Nov 07 2007:	introduce cgroup_may_hijack, and may_hijack hook to
		cgroup subsystems.  The ns subsystem uses this to
		enforce the rule that one may only hijack descendent
		namespaces.
  Nov 07 2007:	s390 support
  Nov 08 2007:	don't send SIGSTOP to hijack source task
  Nov 10 2007:	cache reference to nsproxy in ns cgroup for use in

		hijacking an empty cgroup.
  Nov 10 2007:	allow partial hijack of empty cgroup
  Nov 13 2007:	don't double-get cgroup for hijack_ns
		find_css_set() actually returns the set with a
		reference already held, so cgroup_fork_fromcgroup()
		by doing a get_css_set() was getting a second
		reference.  Therefore after exiting the hijack
		task we could not rmdir the csgroup.
  Nov 22 2007:	temporarily remove x86_64 and powerpc support
  Nov 27 2007:	rebased on 2.6.24-rc3
  Jan 09 2008:	removed hijack pid and hijack cgroup options
  Jan 11 2008:	renamed cgroup_fork_fromcgroup() to be
		cgroup_fork_into_cgroup()

==============================================================
hijack.c
==============================================================
 #include <stdio.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <fcntl.h>
 #include <signal.h>
 #include <sys/wait.h>
 #include <stdlib.h>
 #include <unistd.h>

 #define __NR_hijack 333

/*
 *	hijack /cgroup/node_1078/tasks
 */

void usage(char *me)
{
	printf("Usage: %s <cgroup_tasks_file>\n", me);
	exit(1);
}

int exec_shell(void)
{
	execl("/bin/sh", "/bin/sh", NULL);
}

int main(int argc, char *argv[])
{
	int id;
	int ret;
	int status;

	if (argc < 2 || !strcmp(argv[1], "-h"))
		usage(argv[0]);

	id = open(argv[1], O_RDONLY);
	if (id == -1) {
		perror("cgroup open");
		return 1;
	}

	ret = syscall(__NR_hijack, SIGCHLD, (unsigned long)id);

	if  (ret == 0) {
		return exec_shell();
	} else if (ret < 0) {
		perror("sys_hijack");
	} else {
		printf("waiting on cloned process %d\n", ret);
		while(waitpid(-1, &status, __WALL) != -1)
				;
		printf("cloned process exited with %d (waitpid ret %d)\n",
				status, ret);
	}

	return ret;
}
==============================================================

Signed-off-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Mark Nelson <markn-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org>
---
 Documentation/cgroups.txt          |    8 +++
 arch/s390/kernel/process.c         |   10 +++
 arch/x86/kernel/process_32.c       |   10 +++
 arch/x86/kernel/syscall_table_32.S |    1 +
 include/asm-x86/unistd_32.h        |    1 +
 include/linux/cgroup.h             |   12 ++++
 include/linux/nsproxy.h            |   13 ++++-
 include/linux/sched.h              |    3 +
 include/linux/syscalls.h           |    2 +
 kernel/cgroup.c                    |   32 +++++++++++
 kernel/fork.c                      |   59 +++++++++++++++++---
 kernel/ns_cgroup.c                 |  108 +++++++++++++++++++++++++++++++++++-
 kernel/nsproxy.c                   |    2 +-
 13 files changed, 248 insertions(+), 13 deletions(-)

diff --git a/Documentation/cgroups.txt b/Documentation/cgroups.txt
index d9014aa..b7ba41e 100644
--- a/Documentation/cgroups.txt
+++ b/Documentation/cgroups.txt
@@ -502,6 +502,14 @@ void attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
 Called after the task has been attached to the cgroup, to allow any
 post-attachment activity that requires memory allocations or blocking.
 
+int may_hijack(struct cgroup_subsys *ss, struct cgroup *new_cgroup,
+	       struct task_struct *task)
+
+Called prior to hijacking a cgroup.  Current is cloning a new child
+which is hijacking the cgroup and namespace from the target cgroup.
+Security context is kept from the original (forking) process.
+Return 0 to allow.
+
 void fork(struct cgroup_subsy *ss, struct task_struct *task)
 
 Called when a task is forked into a cgroup.
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 9839767..3b64077 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -278,6 +278,16 @@ asmlinkage long sys_clone(void)
 		       parent_tidptr, child_tidptr);
 }
 
+asmlinkage long sys_hijack(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long sp = regs->orig_gpr2;
+	unsigned long clone_flags = regs->gprs[3];
+	unsigned int fd = regs->gprs[4];
+
+	return hijack_ns(fd, clone_flags, *regs, sp);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 53bc653..4711eed 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -36,6 +36,7 @@
 #include <linux/personality.h>
 #include <linux/tick.h>
 #include <linux/percpu.h>
+#include <linux/cgroup.h>
 #include <linux/prctl.h>
 
 #include <asm/uaccess.h>
@@ -648,6 +649,15 @@ asmlinkage int sys_clone(struct pt_regs regs)
 	return do_fork(clone_flags, newsp, &regs, 0, parent_tidptr, child_tidptr);
 }
 
+asmlinkage int sys_hijack(struct pt_regs regs)
+{
+	unsigned long sp = regs.sp;
+	unsigned long clone_flags = regs.bx;
+	unsigned int fd = regs.cx;
+
+	return hijack_ns(fd, clone_flags, regs, sp);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..fd9d4f4 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,4 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_hijack
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index d739467..70280da 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -338,6 +338,7 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_hijack		333
 
 #ifdef __KERNEL__
 
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c98dd7c..ca6a439 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -30,6 +30,8 @@ extern void cgroup_lock(void);
 extern bool cgroup_lock_live_group(struct cgroup *cgrp);
 extern void cgroup_unlock(void);
 extern void cgroup_fork(struct task_struct *p);
+extern void cgroup_fork_into_cgroup(struct cgroup *new_cg,
+					struct task_struct *child);
 extern void cgroup_fork_callbacks(struct task_struct *p);
 extern void cgroup_post_fork(struct task_struct *p);
 extern void cgroup_exit(struct task_struct *p, int run_callbacks);
@@ -313,6 +315,8 @@ struct cgroup_subsys {
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss,
 			  struct cgroup *cgrp, struct task_struct *tsk);
+	int (*may_hijack)(struct cgroup_subsys *ss,
+			  struct cgroup *cont, struct task_struct *tsk);
 	void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
 			struct cgroup *old_cgrp, struct task_struct *tsk);
 	void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
@@ -393,12 +397,19 @@ void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
 
+struct cgroup *cgroup_from_fd(unsigned int fd);
 #else /* !CONFIG_CGROUPS */
+struct cgroup {
+};
 
 static inline int cgroup_init_early(void) { return 0; }
 static inline int cgroup_init(void) { return 0; }
 static inline void cgroup_init_smp(void) {}
 static inline void cgroup_fork(struct task_struct *p) {}
+
+static inline void cgroup_fork_into_cgroup(struct cgroup *new_cg,
+					struct task_struct *child) {}
+
 static inline void cgroup_fork_callbacks(struct task_struct *p) {}
 static inline void cgroup_post_fork(struct task_struct *p) {}
 static inline void cgroup_exit(struct task_struct *p, int callbacks) {}
@@ -411,6 +422,7 @@ static inline int cgroupstats_build(struct cgroupstats *stats,
 	return -EINVAL;
 }
 
+static inline struct cgroup *cgroup_from_fd(unsigned int fd) { return NULL; }
 #endif /* !CONFIG_CGROUPS */
 
 #ifdef CONFIG_MM_OWNER
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index c8a768e..d5d6def 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -3,6 +3,7 @@
 
 #include <linux/spinlock.h>
 #include <linux/sched.h>
+#include <linux/err.h>
 
 struct mnt_namespace;
 struct uts_namespace;
@@ -81,13 +82,21 @@ static inline void get_nsproxy(struct nsproxy *ns)
 	atomic_inc(&ns->count);
 }
 
+struct cgroup;
 #ifdef CONFIG_CGROUP_NS
-int ns_cgroup_clone(struct task_struct *tsk, struct pid *pid);
+int ns_cgroup_clone(struct task_struct *tsk, struct pid *pid,
+			struct nsproxy *nsproxy);
+int ns_cgroup_verify(struct cgroup *cgroup);
+void copy_hijack_nsproxy(struct task_struct *tsk, struct cgroup *cgroup);
 #else
-static inline int ns_cgroup_clone(struct task_struct *tsk, struct pid *pid)
+static inline int ns_cgroup_clone(struct task_struct *tsk, struct pid *pid,
+	struct nsproxy *nsproxy)
 {
 	return 0;
 }
+static inline int ns_cgroup_verify(struct cgroup *cgroup) { return 0; }
+static inline void copy_hijack_nsproxy(struct task_struct *tsk,
+				       struct cgroup *cgroup) {}
 #endif
 
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5270d44..f12c891 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1871,6 +1871,9 @@ extern int do_execve(char *, char __user * __user *, char __user * __user *, str
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 
+extern int hijack_ns(unsigned int fd, unsigned long clone_flags,
+		struct pt_regs regs, unsigned long sp);
+
 extern void set_task_comm(struct task_struct *tsk, char *from);
 extern char *get_task_comm(char *to, struct task_struct *tsk);
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d6ff145..8a031c8 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -625,4 +625,6 @@ asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
+asmlinkage long sys_hijack(unsigned long flags, unsigned long fd);
+
 #endif
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 13932ab..7108626 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -46,6 +46,7 @@
 #include <linux/cgroupstats.h>
 #include <linux/hash.h>
 #include <linux/namei.h>
+#include <linux/file.h>
 
 #include <asm/atomic.h>
 
@@ -2707,6 +2708,14 @@ void cgroup_fork(struct task_struct *child)
 	INIT_LIST_HEAD(&child->cg_list);
 }
 
+void cgroup_fork_into_cgroup(struct cgroup *new_cg, struct task_struct *child)
+{
+	mutex_lock(&cgroup_mutex);
+	child->cgroups = find_css_set(child->cgroups, new_cg);
+	INIT_LIST_HEAD(&child->cg_list);
+	mutex_unlock(&cgroup_mutex);
+}
+
 /**
  * cgroup_fork_callbacks - run fork callbacks
  * @child: the new task
@@ -3095,6 +3104,29 @@ static void cgroup_release_agent(struct work_struct *work)
 	mutex_unlock(&cgroup_mutex);
 }
 
+struct cgroup *cgroup_from_fd(unsigned int fd)
+{
+	struct file *file;
+	struct cgroup *cgroup = NULL;;
+
+	file = fget(fd);
+	if (!file)
+		return NULL;
+
+	if (!file->f_dentry || !file->f_dentry->d_sb)
+		goto out_fput;
+	if (file->f_dentry->d_parent->d_sb->s_magic != CGROUP_SUPER_MAGIC)
+		goto out_fput;
+	if (strcmp(file->f_dentry->d_name.name, "tasks"))
+		goto out_fput;
+
+	cgroup = __d_cgrp(file->f_dentry->d_parent);
+
+out_fput:
+	fput(file);
+	return cgroup;
+}
+
 static int __init cgroup_disable(char *str)
 {
 	int i;
diff --git a/kernel/fork.c b/kernel/fork.c
index 7ce2ebe..52b5037 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -674,11 +674,15 @@ EXPORT_SYMBOL_GPL(copy_fs_struct);
 
 static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
 {
+	if (!tsk->fs) {
+		printk(KERN_NOTICE "%s: tsk didn't have fs\n", __func__);
+		tsk->fs = current->fs;
+	}
 	if (clone_flags & CLONE_FS) {
-		atomic_inc(&current->fs->count);
+		atomic_inc(&tsk->fs->count);
 		return 0;
 	}
-	tsk->fs = __copy_fs_struct(current->fs);
+	tsk->fs = __copy_fs_struct(tsk->fs);
 	if (!tsk->fs)
 		return -ENOMEM;
 	return 0;
@@ -893,7 +897,8 @@ void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
  * parts of the process environment (as per the clone
  * flags). The actual kick-off is left to the caller.
  */
-static struct task_struct *copy_process(unsigned long clone_flags,
+static struct task_struct *copy_process(struct cgroup *cgroup,
+					unsigned long clone_flags,
 					unsigned long stack_start,
 					struct pt_regs *regs,
 					unsigned long stack_size,
@@ -931,6 +936,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p = dup_task_struct(current);
 	if (!p)
 		goto fork_out;
+	if (cgroup)
+		copy_hijack_nsproxy(p, cgroup);
 
 	rt_mutex_init_task(p);
 
@@ -1012,7 +1019,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->cap_bset = current->cap_bset;
 	p->io_context = NULL;
 	p->audit_context = NULL;
-	cgroup_fork(p);
+	if (cgroup)
+		cgroup_fork_into_cgroup(cgroup, p);
+	else
+		cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
  	if (IS_ERR(p->mempolicy)) {
@@ -1100,7 +1110,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		p->tgid = current->tgid;
 
 	if (current->nsproxy != p->nsproxy) {
-		retval = ns_cgroup_clone(p, pid);
+		retval = ns_cgroup_clone(p, pid, p->nsproxy);
 		if (retval)
 			goto bad_fork_free_pid;
 	}
@@ -1304,7 +1314,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 	struct task_struct *task;
 	struct pt_regs regs;
 
-	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
+	task = copy_process(NULL, CLONE_VM, 0, idle_regs(&regs), 0, NULL,
 			    &init_struct_pid, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
@@ -1318,7 +1328,8 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_task(struct cgroup *cgroup,
+	      unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
@@ -1352,7 +1363,7 @@ long do_fork(unsigned long clone_flags,
 	if (likely(user_mode(regs)))
 		trace = tracehook_prepare_clone(clone_flags);
 
-	p = copy_process(clone_flags, stack_start, regs, stack_size,
+	p = copy_process(cgroup, clone_flags, stack_start, regs, stack_size,
 			 child_tidptr, NULL, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
@@ -1407,6 +1418,38 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+/*
+ *  Ok, this is the main fork-routine.
+ *
+ * It copies the process, and if successful kick-starts
+ * it and waits for it to finish using the VM if required.
+ */
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_task(NULL, clone_flags, stack_start,
+		regs, stack_size, parent_tidptr, child_tidptr);
+}
+
+int hijack_ns(unsigned int fd, unsigned long clone_flags,
+		  struct pt_regs regs, unsigned long sp)
+{
+	struct cgroup *cgroup;
+
+	cgroup = cgroup_from_fd(fd);
+	if (!cgroup)
+		return -EINVAL;
+
+	if (!ns_cgroup_verify(cgroup))
+		return -EINVAL;
+
+	return do_fork_task(cgroup, clone_flags, sp, &regs, 0, NULL, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
diff --git a/kernel/ns_cgroup.c b/kernel/ns_cgroup.c
index 43c2111..1ab8699 100644
--- a/kernel/ns_cgroup.c
+++ b/kernel/ns_cgroup.c
@@ -10,9 +10,12 @@
 #include <linux/proc_fs.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/fs_struct.h>
 
 struct ns_cgroup {
 	struct cgroup_subsys_state css;
+	struct nsproxy *nsproxy;
+	struct fs_struct *fs;
 	spinlock_t lock;
 };
 
@@ -25,12 +28,67 @@ static inline struct ns_cgroup *cgroup_to_ns(
 			    struct ns_cgroup, css);
 }
 
-int ns_cgroup_clone(struct task_struct *task, struct pid *pid)
+int ns_cgroup_clone(struct task_struct *task, struct pid *pid,
+				struct nsproxy *nsproxy)
 {
 	char name[PROC_NUMBUF];
+	struct cgroup *cgroup;
+	struct ns_cgroup *ns_cgroup;
+	int ret;
 
 	snprintf(name, PROC_NUMBUF, "%d", pid_vnr(pid));
-	return cgroup_clone(task, &ns_subsys, name);
+
+	ret = cgroup_clone(task, &ns_subsys, name);
+
+	if (ret)
+		return ret;
+
+	cgroup = task_cgroup(task, ns_subsys_id);
+	ns_cgroup = cgroup_to_ns(cgroup);
+	ns_cgroup->nsproxy = nsproxy;
+	ns_cgroup->fs = task->fs;
+	atomic_inc(&task->fs->count);
+	get_nsproxy(nsproxy);
+
+	return 0;
+}
+
+/*
+ * Verify that the cgroup contains a valid ns_cgroup (which can
+ * be entered), in case a different cgroup fd is passed in, for
+ * example a cpuset cgroup.
+ */
+int ns_cgroup_verify(struct cgroup *cgroup)
+{
+	struct cgroup_subsys_state *css;
+	struct ns_cgroup *ns_cgroup;
+
+	css = cgroup_subsys_state(cgroup, ns_subsys_id);
+	if (!css)
+		return 0;
+	ns_cgroup = container_of(css, struct ns_cgroup, css);
+	if (!ns_cgroup->nsproxy)
+		return 0;
+	return 1;
+}
+
+/*
+ * this shouldn't be called unless ns_cgroup_verify() has
+ * confirmed that there is a ns_cgroup in this cgroup
+ * (which is done in hijack_ns() at the moment)
+ *
+ * tsk is not yet running, and has not yet taken a reference
+ * to it's previous ->nsproxy, so we just do a simple assignment
+ * rather than switch_task_namespaces()
+ * Same with fs
+ */
+void copy_hijack_nsproxy(struct task_struct *tsk, struct cgroup *cgroup)
+{
+	struct ns_cgroup *ns_cgroup;
+
+	ns_cgroup = cgroup_to_ns(cgroup);
+	tsk->nsproxy = ns_cgroup->nsproxy;
+	tsk->fs = ns_cgroup->fs;
 }
 
 /*
@@ -66,6 +124,46 @@ static int ns_can_attach(struct cgroup_subsys *ss,
 	return 0;
 }
 
+static void ns_attach(struct cgroup_subsys *ss,
+			  struct cgroup *cgroup, struct cgroup *oldcgroup,
+			  struct task_struct *tsk)
+{
+	struct ns_cgroup *ns_cgroup = cgroup_to_ns(cgroup);
+
+	if (likely(ns_cgroup->nsproxy))
+		return;
+
+	spin_lock(&ns_cgroup->lock);
+	if (!ns_cgroup->nsproxy) {
+		ns_cgroup->nsproxy = tsk->nsproxy;
+		get_nsproxy(ns_cgroup->nsproxy);
+	}
+	if (!ns_cgroup->fs) {
+		ns_cgroup->fs = tsk->fs;
+		atomic_inc(&ns_cgroup->fs->count);
+	}
+	spin_unlock(&ns_cgroup->lock);
+}
+
+/*
+ * only allow hijacking child namespaces
+ * Q: is it crucial to prevent hijacking a task in your same cgroup?
+ */
+static int ns_may_hijack(struct cgroup_subsys *ss,
+		struct cgroup *new_cgroup, struct task_struct *task)
+{
+	if (current == task)
+		return -EINVAL;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (!cgroup_is_descendant(new_cgroup))
+		return -EPERM;
+
+	return 0;
+}
+
 /*
  * Rules: you can only create a cgroup if
  *     1. you are capable(CAP_SYS_ADMIN)
@@ -94,12 +192,18 @@ static void ns_destroy(struct cgroup_subsys *ss,
 	struct ns_cgroup *ns_cgroup;
 
 	ns_cgroup = cgroup_to_ns(cgroup);
+	if (ns_cgroup->nsproxy)
+		put_nsproxy(ns_cgroup->nsproxy);
+	if (ns_cgroup->fs)
+		put_fs_struct(ns_cgroup->fs);
 	kfree(ns_cgroup);
 }
 
 struct cgroup_subsys ns_subsys = {
 	.name = "ns",
 	.can_attach = ns_can_attach,
+	.attach = ns_attach,
+	.may_hijack = ns_may_hijack,
 	.create = ns_create,
 	.destroy  = ns_destroy,
 	.subsys_id = ns_subsys_id,
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 21575fc..70b8c7f 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -203,7 +203,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
 		goto out;
 	}
 
-	err = ns_cgroup_clone(current, task_pid(current));
+	err = ns_cgroup_clone(current, task_pid(current), *new_nsp);
 	if (err)
 		put_nsproxy(*new_nsp);
 
-- 
1.5.3.6

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found] ` <20080731183213.GA12033-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-01  8:30   ` KOSAKI Motohiro
       [not found]     ` <20080801172811.FEC3.KOSAKI.MOTOHIRO-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2008-08-01  9:23   ` Bastian Blank
  1 sibling, 1 reply; 12+ messages in thread
From: KOSAKI Motohiro @ 2008-08-01  8:30 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers, Pavel Emelyanov

Hi

fork() is very important for system performance.
I worry about performance regression if this feature isn't used.

Could you mesure spawn benchmark in unixbench?



> Hi Pavel,
> 
> Here is the 'hijack' patch that was mentioned during the namespaces
> part of the containers mini-summit.  It's a proposed way of entering
> namespaces.
> 
> It's been rotting for awhile as you can see by the changelog, but
> hopefully I updated it sufficiently and correctly.
> 
> -serge

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found] ` <20080731183213.GA12033-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-08-01  8:30   ` KOSAKI Motohiro
@ 2008-08-01  9:23   ` Bastian Blank
       [not found]     ` <20080801092318.GA2002-0IJIQSrh9RL9UF0aPl6fsj8Kkb2uy4ct@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Bastian Blank @ 2008-08-01  9:23 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers, Pavel Emelyanov

On Thu, Jul 31, 2008 at 01:32:13PM -0500, Serge E. Hallyn wrote:
> Introduce sys_hijack (for i386 and s390 only so far).  An open
> fd for a cgroup 'tasks' file is specified.  The main purpose
> is to allow entering an empty cgroup without having to keep a
> task alive in the target cgroup.

What is the problem if no task is alive in the target?

> The effect is a sort of namespace enter.  The following program
> uses sys_hijack to 'enter' all namespaces of the specified
> cgroup.

I currently fail to see what the differences to a normal cgroup attach
is.

>         For instance in one terminal, do
> 
> 	mount -t cgroup -ons cgroup /cgroup
> 	hostname
> 	  qemu
> 	ns_exec -u /bin/sh
> 	  hostname serge
>           echo $$
>             2996
> 	  cat /proc/$$/cgroup
> 	    ns:/node_2996
> 
> In another terminal then do
> 
> 	hostname
> 	  qemu
> 	cat /proc/$$/cgroup
> 	  ns:/
> 	hijack /cgroup/node_2996/tasks

Why can't this be done by a echo $$ >> /cgroup/node_2996/attach?

> 	  hostname
> 	    serge
> 	  cat /proc/$$/cgroup
> 	    ns:/node_2996

Bastian

-- 
Star Trek Lives!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found]     ` <20080801092318.GA2002-0IJIQSrh9RL9UF0aPl6fsj8Kkb2uy4ct@public.gmane.org>
@ 2008-08-01 14:11       ` Serge E. Hallyn
       [not found]         ` <20080801141152.GA11553-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Serge E. Hallyn @ 2008-08-01 14:11 UTC (permalink / raw)
  To: Bastian Blank; +Cc: Linux Containers, Pavel Emelyanov

Quoting Bastian Blank (bastian-yyjItF7Rl6lg9hUCZPvPmw@public.gmane.org):
> On Thu, Jul 31, 2008 at 01:32:13PM -0500, Serge E. Hallyn wrote:
> > Introduce sys_hijack (for i386 and s390 only so far).  An open
> > fd for a cgroup 'tasks' file is specified.  The main purpose
> > is to allow entering an empty cgroup without having to keep a
> > task alive in the target cgroup.
> 
> What is the problem if no task is alive in the target?

Oh, that comment dates back to when I first introduced the
attach-by-ns_cgroup feature.  Before that one had to specify a process
id of an existing task, resulting in hijacking that task.

Eventually, we dropped the hijack_by_pid entirely.

> > The effect is a sort of namespace enter.  The following program
> > uses sys_hijack to 'enter' all namespaces of the specified
> > cgroup.
> 
> I currently fail to see what the differences to a normal cgroup attach
> is.

A normal cgroup attach doesn't switch a task's root and nsproxies.

> >         For instance in one terminal, do
> > 
> > 	mount -t cgroup -ons cgroup /cgroup
> > 	hostname
> > 	  qemu
> > 	ns_exec -u /bin/sh
> > 	  hostname serge
> >           echo $$
> >             2996
> > 	  cat /proc/$$/cgroup
> > 	    ns:/node_2996
> > 
> > In another terminal then do
> > 
> > 	hostname
> > 	  qemu
> > 	cat /proc/$$/cgroup
> > 	  ns:/
> > 	hijack /cgroup/node_2996/tasks
> 
> Why can't this be done by a echo $$ >> /cgroup/node_2996/attach?

Do you mean "why does that current functionality not suffice", or "why
didn't you implement the feature with those semantics"?

Current functionality doesn't suffice because namespaces and
fs_struct are not switched with cgroup attach.  Cgroup attach is
just about tracking tasks, and keeping stats and enforcing limits or
guarantees on the groups.

The problem with implementing this feature using the attach
semantics is that it would move an existing task into the new
cgroup.  That would get much more complicated, especially when
you consider pid namespaces, where we explicitly refuse to
unshare for the same reason.

That is why, with hijack, we clone a new task which is started
afresh in the new namespaces.

thanks,
-serge

> > 	  hostname
> > 	    serge
> > 	  cat /proc/$$/cgroup
> > 	    ns:/node_2996
> 
> Bastian
> 
> -- 
> Star Trek Lives!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found]     ` <20080801172811.FEC3.KOSAKI.MOTOHIRO-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2008-08-01 14:22       ` Serge E. Hallyn
  2008-08-07 19:23       ` Serge E. Hallyn
  1 sibling, 0 replies; 12+ messages in thread
From: Serge E. Hallyn @ 2008-08-01 14:22 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Linux Containers, Pavel Emelyanov

Quoting KOSAKI Motohiro (kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org):
> Hi
> 
> fork() is very important for system performance.
> I worry about performance regression if this feature isn't used.
> 
> Could you mesure spawn benchmark in unixbench?

Yes, I will do that.  I'll send the results next week.

Some context and history about this patch:

	1. Originally the 'enter a namespace' functionality was
	   implemented (in vserver and in -lxc, maybe also in
	   openvz) by specifying an integer id for a container
	   to enter.  See for instance the patches under
	   http://lxc.sourceforge.net/patches/2.6.25/2.6.25-rc8-mm2-lxc1/broken-out/bind_ns/

	2. At one point I started extending the ns_cgroup to
	   allow namespace entering by attaching to an ns_cgroup.
	   So 'echo $$ > /containers/5487/tasks' would switch
	   your nsproxy.  People didn't like that much in part
	   because switching the namespaces while a task is
	   executing seems frought with potential dangers,
	   though those remain unexplored.

	3. Eric suggested that namespace entering should just
	   be done by ptracing a process in a container and
	   bending it to our will.  The hijack patchset was
	   sort of a response to that.  Strictly using ptrace
	   seemed to me to leave to leave us too much at the
	   mercy of the target's context.
	   For instance, if using selinux, when you ptrace a
	   target you are subject to the target's permissions.
	   With hijack, you'll continue with your own,
	   presumably administrator-level, privilege level.

	4. We stopped posting this patch some time ago because
	   it seems worth seeing just how far we can go in
	   terms of analyzing and configuring containers by
	   properly setting up namespaces (i.e. hierarchical
	   pid and user namespaces) and using filesystems
	   (in conjunction with mounts propagation) to do the
	   configuration.
	   But at the mini-summit last week it was mentioned
	   that OpenVZ still wants namespace entering, and
	   there seemed to be no real opposition, so we told
	   Pavel we would repost this patchset.

thanks,
-serge

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found]         ` <20080801141152.GA11553-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-01 15:51           ` Bastian Blank
       [not found]             ` <20080801155148.GA16760-0IJIQSrh9RL9UF0aPl6fsj8Kkb2uy4ct@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Bastian Blank @ 2008-08-01 15:51 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Linux Containers, Pavel Emelyanov

On Fri, Aug 01, 2008 at 09:11:53AM -0500, Serge E. Hallyn wrote:
> Quoting Bastian Blank (bastian-yyjItF7Rl6lg9hUCZPvPmw@public.gmane.org):
> > On Thu, Jul 31, 2008 at 01:32:13PM -0500, Serge E. Hallyn wrote:
> > > The effect is a sort of namespace enter.  The following program
> > > uses sys_hijack to 'enter' all namespaces of the specified
> > > cgroup.
> > 
> > I currently fail to see what the differences to a normal cgroup attach
> > is.
> 
> A normal cgroup attach doesn't switch a task's root and nsproxies.

> Current functionality doesn't suffice because namespaces and
> fs_struct are not switched with cgroup attach.  Cgroup attach is
> just about tracking tasks, and keeping stats and enforcing limits or
> guarantees on the groups.

If you apply a nsproxy to a cgroup, it is part of its limits.

> The problem with implementing this feature using the attach
> semantics is that it would move an existing task into the new
> cgroup.  That would get much more complicated, especially when
> you consider pid namespaces, where we explicitly refuse to
> unshare for the same reason.

Okay, this is a reason. But I think it should disallow attach after the
nsproxy is set, otherwise you can use attach and hijack for the same
cgroup and produce different behaviour. The description of the
can_attach method does not mention such a test, but it seems to do one.

Why is it not enough to use the pid of the ns creator? The ns cgroups
are created including the pid in the name. And it would avoid using that
weird interface with fd of a cgroups file.

> That is why, with hijack, we clone a new task which is started
> afresh in the new namespaces.

Why did you name it "hijack"? If I had not read the mail, I'd no idea
what this is about. It does not take away the information from something
else, it overrides the information (nsprox, fs) on the new task.

But I think I have a different problem. Currently, namespaces are
destructed if the last process using them exits. You change that, they
will survive until the cgroup dies. Or is that cgroup destructed when
there are no longer processes using the nsproxy? As the commit message
speaks about "pid wraparound" as problem, I doubt that.

Bastian

-- 
To live is always desirable.
		-- Eleen the Capellan, "Friday's Child", stardate 3498.9

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found]             ` <20080801155148.GA16760-0IJIQSrh9RL9UF0aPl6fsj8Kkb2uy4ct@public.gmane.org>
@ 2008-08-01 16:39               ` Serge E. Hallyn
       [not found]                 ` <20080801163905.GA4647-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Serge E. Hallyn @ 2008-08-01 16:39 UTC (permalink / raw)
  To: Bastian Blank; +Cc: Linux Containers, Pavel Emelyanov

Quoting Bastian Blank (bastian-yyjItF7Rl6lg9hUCZPvPmw@public.gmane.org):
> On Fri, Aug 01, 2008 at 09:11:53AM -0500, Serge E. Hallyn wrote:
> > Quoting Bastian Blank (bastian-yyjItF7Rl6lg9hUCZPvPmw@public.gmane.org):
> > > On Thu, Jul 31, 2008 at 01:32:13PM -0500, Serge E. Hallyn wrote:
> > > > The effect is a sort of namespace enter.  The following program
> > > > uses sys_hijack to 'enter' all namespaces of the specified
> > > > cgroup.
> > > 
> > > I currently fail to see what the differences to a normal cgroup attach
> > > is.
> > 
> > A normal cgroup attach doesn't switch a task's root and nsproxies.
> 
> > Current functionality doesn't suffice because namespaces and
> > fs_struct are not switched with cgroup attach.  Cgroup attach is
> > just about tracking tasks, and keeping stats and enforcing limits or
> > guarantees on the groups.
> 
> If you apply a nsproxy to a cgroup, it is part of its limits.
> 
> > The problem with implementing this feature using the attach
> > semantics is that it would move an existing task into the new
> > cgroup.  That would get much more complicated, especially when
> > you consider pid namespaces, where we explicitly refuse to
> > unshare for the same reason.
> 
> Okay, this is a reason. But I think it should disallow attach after the
> nsproxy is set, otherwise you can use attach and hijack for the same
> cgroup and produce different behaviour. The description of the
> can_attach method does not mention such a test, but it seems to do one.

Hmm, the description should mention that.  Yes, you can onlly attach
to an empty cgroup.  Obviously that behavior was changed in the
patchset which implemented namespace entering through attach, but
that patchset was tossed.

> Why is it not enough to use the pid of the ns creator? The ns cgroups

pids wrap around

> are created including the pid in the name. And it would avoid using that
> weird interface with fd of a cgroups file.
> 
> > That is why, with hijack, we clone a new task which is started
> > afresh in the new namespaces.
> 
> Why did you name it "hijack"? If I had not read the mail, I'd no idea
> what this is about. It does not take away the information from something
> else, it overrides the information (nsprox, fs) on the new task.

We can call it whatever we want, but originally it amounted to hijacking
the namespace info from an existing task so it seemed as good a name as
any for an RFC.

> But I think I have a different problem. Currently, namespaces are
> destructed if the last process using them exits. You change that, they
> will survive until the cgroup dies. Or is that cgroup destructed when
> there are no longer processes using the nsproxy? As the commit message
> speaks about "pid wraparound" as problem, I doubt that.

Correct.  Having the namespaces stick around, and being able to attach
to an empty container, was something Paul Menage had wanted IIRC.

Hmm, and now that you mention it, I notice that actually the /proc for
a container doesn't behave right after all tasks in a container exit.

But I'll leave that as is for now, until I hear something other than
"this is so wrong it isn't funny" from Pavel :)

-serge

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found]                 ` <20080801163905.GA4647-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-01 17:19                   ` Bastian Blank
       [not found]                     ` <20080801171951.GA23754-0IJIQSrh9RL9UF0aPl6fsj8Kkb2uy4ct@public.gmane.org>
  2008-08-01 17:22                   ` Bastian Blank
  1 sibling, 1 reply; 12+ messages in thread
From: Bastian Blank @ 2008-08-01 17:19 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Fri, Aug 01, 2008 at 11:39:05AM -0500, Serge E. Hallyn wrote:
> Quoting Bastian Blank (bastian-yyjItF7Rl6lg9hUCZPvPmw@public.gmane.org):
> > Why is it not enough to use the pid of the ns creator? The ns cgroups
> 
> pids wrap around

Ups, yes.

> > But I think I have a different problem. Currently, namespaces are
> > destructed if the last process using them exits. You change that, they
> > will survive until the cgroup dies. Or is that cgroup destructed when
> > there are no longer processes using the nsproxy? As the commit message
> > speaks about "pid wraparound" as problem, I doubt that.
> 
> Correct.  Having the namespaces stick around, and being able to attach
> to an empty container, was something Paul Menage had wanted IIRC.

It may produce problems with pid namespaces. The namespace is cleared if
the child reaper dies and I'm not sure how well it behaves without a new
one, which you can't create.

> But I'll leave that as is for now, until I hear something other than
> "this is so wrong it isn't funny" from Pavel :)

I'm not sure if it is funny to add another piece which may hold
filesystems open. Currently we can have different namespaces. All of
them are attached to processes and can be removed with kill. Now this
code adds another copy to an (automatically created) cgroup.

IMHO, the cgroup should be destructed automatically if the nsproxy is
about to be die.

Bastian

-- 
Deflector shields just came on, Captain.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found]                 ` <20080801163905.GA4647-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-08-01 17:19                   ` Bastian Blank
@ 2008-08-01 17:22                   ` Bastian Blank
  1 sibling, 0 replies; 12+ messages in thread
From: Bastian Blank @ 2008-08-01 17:22 UTC (permalink / raw)
  To: Serge E. Hallyn, PavelEmelyanov; +Cc: Linux Containers

Resend: Removed some recipients by accident.

On Fri, Aug 01, 2008 at 11:39:05AM -0500, Serge E. Hallyn wrote:
> Quoting Bastian Blank (bastian-yyjItF7Rl6lg9hUCZPvPmw@public.gmane.org):
> > Why is it not enough to use the pid of the ns creator? The ns cgroups
> 
> pids wrap around

Ups, yes.

> > But I think I have a different problem. Currently, namespaces are
> > destructed if the last process using them exits. You change that, they
> > will survive until the cgroup dies. Or is that cgroup destructed when
> > there are no longer processes using the nsproxy? As the commit message
> > speaks about "pid wraparound" as problem, I doubt that.
> 
> Correct.  Having the namespaces stick around, and being able to attach
> to an empty container, was something Paul Menage had wanted IIRC.

It may produce problems with pid namespaces. The namespace is cleared if
the child reaper dies and I'm not sure how well it behaves without a new
one, which you can't create.

> But I'll leave that as is for now, until I hear something other than
> "this is so wrong it isn't funny" from Pavel :)

I'm not sure if it is funny to add another piece which may hold
filesystems open. Currently we can have different namespaces. All of
them are attached to processes and can be removed with kill. Now this
code adds another copy to an (automatically created) cgroup.

IMHO, the cgroup should be destructed automatically if the nsproxy is
about to be die.

Bastian

-- 
Deflector shields just came on, Captain.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found]                     ` <20080801171951.GA23754-0IJIQSrh9RL9UF0aPl6fsj8Kkb2uy4ct@public.gmane.org>
@ 2008-08-01 17:38                       ` Serge E. Hallyn
       [not found]                         ` <20080801173817.GA21367-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Serge E. Hallyn @ 2008-08-01 17:38 UTC (permalink / raw)
  To: Bastian Blank; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Bastian Blank (bastian-yyjItF7Rl6lg9hUCZPvPmw@public.gmane.org):
> On Fri, Aug 01, 2008 at 11:39:05AM -0500, Serge E. Hallyn wrote:
> > Quoting Bastian Blank (bastian-yyjItF7Rl6lg9hUCZPvPmw@public.gmane.org):
> > > Why is it not enough to use the pid of the ns creator? The ns cgroups
> > 
> > pids wrap around
> 
> Ups, yes.
> 
> > > But I think I have a different problem. Currently, namespaces are
> > > destructed if the last process using them exits. You change that, they
> > > will survive until the cgroup dies. Or is that cgroup destructed when
> > > there are no longer processes using the nsproxy? As the commit message
> > > speaks about "pid wraparound" as problem, I doubt that.
> > 
> > Correct.  Having the namespaces stick around, and being able to attach
> > to an empty container, was something Paul Menage had wanted IIRC.
> 
> It may produce problems with pid namespaces. The namespace is cleared if
> the child reaper dies and I'm not sure how well it behaves without a new
> one, which you can't create.
> 
> > But I'll leave that as is for now, until I hear something other than
> > "this is so wrong it isn't funny" from Pavel :)
> 
> I'm not sure if it is funny to add another piece which may hold
> filesystems open. Currently we can have different namespaces. All of
> them are attached to processes and can be removed with kill. Now this
> code adds another copy to an (automatically created) cgroup.
> 
> IMHO, the cgroup should be destructed automatically if the nsproxy is
> about to be die.

I certainly don't think your caution is unwarranted.  I like to keep the
refcounting in all of this as simple as possible.

However, it does place you at odds with one of the few people who has
expressed an existing need for this functionality :)

Paul, could you respond with precisely what your needs are?  I.e. do
you really only care about the network namespace sticking around?  Need
all of it?  Not need this at all anymore?

thanks,
-serge

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found]     ` <20080801172811.FEC3.KOSAKI.MOTOHIRO-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2008-08-01 14:22       ` Serge E. Hallyn
@ 2008-08-07 19:23       ` Serge E. Hallyn
  1 sibling, 0 replies; 12+ messages in thread
From: Serge E. Hallyn @ 2008-08-07 19:23 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Linux Containers, Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 1516 bytes --]

Quoting KOSAKI Motohiro (kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org):
> Hi
> 
> fork() is very important for system performance.
> I worry about performance regression if this feature isn't used.
> 
> Could you mesure spawn benchmark in unixbench?

Sigh, well I did run some benchmarks on a machine on our grid, but none
seem to be giving useful or consistent results.  Unixbench results are
attached in the following files:

unixbench results:
	elm3b165.nohijack.1: result of './Run 1' on plain 2.6.27-rc1
	elm3b165.nohijack.4: result of './Run 4' on plain 2.6.27-rc1
	elm3b165.withhijack.1: result of './Run 1' on 2.6.27-rc1 with hijack patch
	elm3b165.withhijack.4: result of './Run 4' on 2.6.27-rc1 with hijack patch

I suppose I could have automated these to get more runs, but as it is
the results seem meaningless to me.

I also ran dbench and tbench.  All tbench instances were with
5 clients, all dbench with 4 clients.  I ran both dbench and tbench 10
times on each kernel, and here are the mean throughput +/- 95% confidence
interval:

Plain 2.6.27-rc1 kernel:
tbench: 520.355900 +/- 17.104307
dbench: 671.337900 +/- 146.785891

2.6.27-rc1 with hijack:
tbench: 523.213300 +/- 19.726943
dbench: 437.324400 +/- 169.086289

The tbench ones are solidly within the confidence bounds.  The dbench
results aren't really believable.  Cron wasn't running on the machine,
and nothing else *should* have been, but alas I don't own the machine
so don't know what could have been running.

thanks,
-serge

[-- Attachment #2: elm3b165.nohijack.1 --]
[-- Type: text/plain, Size: 2196 bytes --]

                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  8167048.5      216.8
Double-Precision Whetstone                      83.1     1199.0      144.3
Execl Throughput                               188.3     6777.2      359.9
File Copy 1024 bufsize 2000 maxblocks         2672.0   113916.0      426.3
File Copy 256 bufsize 500 maxblocks           1077.0    36714.0      340.9
File Read 4096 bufsize 8000 maxblocks        15382.0   718796.0      467.3
Pipe Throughput                             111814.6  1425864.8      127.5
Pipe-based Context Switching                 15448.6   394297.5      255.2
Process Creation                               569.3    18977.8      333.4
Shell Scripts (8 concurrent)                    44.8     1217.8      271.8
System Call Overhead                        114433.5  2858840.5      249.8
                                                                 =========
     FINAL SCORE                                                     270.3

                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  8362581.0      221.9
Double-Precision Whetstone                      83.1     1203.5      144.8
Execl Throughput                               188.3     6715.0      356.6
File Copy 1024 bufsize 2000 maxblocks         2672.0   107020.0      400.5
File Copy 256 bufsize 500 maxblocks           1077.0    32403.0      300.9
File Read 4096 bufsize 8000 maxblocks        15382.0   678707.0      441.2
Pipe Throughput                             111814.6  1385493.0      123.9
Pipe-based Context Switching                 15448.6   372454.9      241.1
Process Creation                               569.3    18607.9      326.9
Shell Scripts (8 concurrent)                    44.8     1208.8      269.8
System Call Overhead                        114433.5  2691682.8      235.2
                                                                 =========
     FINAL SCORE                                                     260.7


[-- Attachment #3: elm3b165.nohijack.4 --]
[-- Type: text/plain, Size: 2195 bytes --]

                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  8232510.9      218.5
Double-Precision Whetstone                      83.1     1202.4      144.7
Execl Throughput                               188.3     6739.0      357.9
File Copy 1024 bufsize 2000 maxblocks         2672.0   104621.0      391.5
File Copy 256 bufsize 500 maxblocks           1077.0    33092.0      307.3
File Read 4096 bufsize 8000 maxblocks        15382.0   684132.0      444.8
Pipe Throughput                             111814.6  1344322.6      120.2
Pipe-based Context Switching                 15448.6   385326.1      249.4
Process Creation                               569.3    18701.0      328.5
Shell Scripts (8 concurrent)                    44.8     1211.8      270.5
System Call Overhead                        114433.5  2910270.2      254.3
                                                                 =========
     FINAL SCORE                                                     262.6

                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  8247240.2      218.9
Double-Precision Whetstone                      83.1     1198.4      144.2
Execl Throughput                               188.3     6540.9      347.4
File Copy 1024 bufsize 2000 maxblocks         2672.0    97453.0      364.7
File Copy 256 bufsize 500 maxblocks           1077.0    31134.0      289.1
File Read 4096 bufsize 8000 maxblocks        15382.0   781214.0      507.9
Pipe Throughput                             111814.6  1408780.0      126.0
Pipe-based Context Switching                 15448.6   351906.0      227.8
Process Creation                               569.3    18638.3      327.4
Shell Scripts (8 concurrent)                    44.8     1211.8      270.5
System Call Overhead                        114433.5  2737286.8      239.2
                                                                 =========
     FINAL SCORE                                                     259.3

[-- Attachment #4: elm3b165.withhijack.1 --]
[-- Type: text/plain, Size: 2196 bytes --]

                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  8899355.0      236.2
Double-Precision Whetstone                      83.1     1204.8      145.0
Execl Throughput                               188.3     6626.6      351.9
File Copy 1024 bufsize 2000 maxblocks         2672.0   109426.0      409.5
File Copy 256 bufsize 500 maxblocks           1077.0    34428.0      319.7
File Read 4096 bufsize 8000 maxblocks        15382.0   625814.0      406.8
Pipe Throughput                             111814.6  1296658.5      116.0
Pipe-based Context Switching                 15448.6   359970.6      233.0
Process Creation                               569.3    18747.0      329.3
Shell Scripts (8 concurrent)                    44.8     1210.8      270.3
System Call Overhead                        114433.5  2745197.3      239.9
                                                                 =========
     FINAL SCORE                                                     260.2

                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  8823465.9      234.2
Double-Precision Whetstone                      83.1     1205.6      145.1
Execl Throughput                               188.3     6603.0      350.7
File Copy 1024 bufsize 2000 maxblocks         2672.0   104783.0      392.2
File Copy 256 bufsize 500 maxblocks           1077.0    32153.0      298.5
File Read 4096 bufsize 8000 maxblocks        15382.0   859178.0      558.6
Pipe Throughput                             111814.6  1185154.3      106.0
Pipe-based Context Switching                 15448.6   344165.8      222.8
Process Creation                               569.3    18397.4      323.2
Shell Scripts (8 concurrent)                    44.8     1207.6      269.6
System Call Overhead                        114433.5  2654906.5      232.0
                                                                 =========
     FINAL SCORE                                                     260.3


[-- Attachment #5: elm3b165.withhijack.4 --]
[-- Type: text/plain, Size: 2195 bytes --]

                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  8844791.9      234.7
Double-Precision Whetstone                      83.1     1206.1      145.1
Execl Throughput                               188.3     6700.7      355.9
File Copy 1024 bufsize 2000 maxblocks         2672.0   102895.0      385.1
File Copy 256 bufsize 500 maxblocks           1077.0    30585.0      284.0
File Read 4096 bufsize 8000 maxblocks        15382.0   877998.0      570.8
Pipe Throughput                             111814.6  1146617.1      102.5
Pipe-based Context Switching                 15448.6   355382.2      230.0
Process Creation                               569.3    18548.7      325.8
Shell Scripts (8 concurrent)                    44.8     1208.8      269.8
System Call Overhead                        114433.5  2815676.2      246.1
                                                                 =========
     FINAL SCORE                                                     261.2

                     INDEX VALUES            
TEST                                        BASELINE     RESULT      INDEX

Dhrystone 2 using register variables        376783.7  8884179.7      235.8
Double-Precision Whetstone                      83.1     1202.7      144.7
Execl Throughput                               188.3     6543.5      347.5
File Copy 1024 bufsize 2000 maxblocks         2672.0   103051.0      385.7
File Copy 256 bufsize 500 maxblocks           1077.0    32475.0      301.5
File Read 4096 bufsize 8000 maxblocks        15382.0   668682.0      434.7
Pipe Throughput                             111814.6  1269063.8      113.5
Pipe-based Context Switching                 15448.6   357748.7      231.6
Process Creation                               569.3    18862.8      331.3
Shell Scripts (8 concurrent)                    44.8     1206.8      269.4
System Call Overhead                        114433.5  2754854.2      240.7
                                                                 =========
     FINAL SCORE                                                     258.1

[-- Attachment #6: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/1] namespaces: introduce sys_hijack (v11)
       [not found]                         ` <20080801173817.GA21367-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-08-12 17:06                           ` Serge E. Hallyn
  0 siblings, 0 replies; 12+ messages in thread
From: Serge E. Hallyn @ 2008-08-12 17:06 UTC (permalink / raw)
  To: Bastian Blank, Paul Menage, Pavel Emelyanov
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Serge E. Hallyn (serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org):
> Quoting Bastian Blank (bastian-yyjItF7Rl6lg9hUCZPvPmw@public.gmane.org):
> > On Fri, Aug 01, 2008 at 11:39:05AM -0500, Serge E. Hallyn wrote:
> > > Quoting Bastian Blank (bastian-yyjItF7Rl6lg9hUCZPvPmw@public.gmane.org):
> > > > Why is it not enough to use the pid of the ns creator? The ns cgroups
> > > 
> > > pids wrap around
> > 
> > Ups, yes.
> > 
> > > > But I think I have a different problem. Currently, namespaces are
> > > > destructed if the last process using them exits. You change that, they
> > > > will survive until the cgroup dies. Or is that cgroup destructed when
> > > > there are no longer processes using the nsproxy? As the commit message
> > > > speaks about "pid wraparound" as problem, I doubt that.
> > > 
> > > Correct.  Having the namespaces stick around, and being able to attach
> > > to an empty container, was something Paul Menage had wanted IIRC.
> > 
> > It may produce problems with pid namespaces. The namespace is cleared if
> > the child reaper dies and I'm not sure how well it behaves without a new
> > one, which you can't create.
> > 
> > > But I'll leave that as is for now, until I hear something other than
> > > "this is so wrong it isn't funny" from Pavel :)
> > 
> > I'm not sure if it is funny to add another piece which may hold
> > filesystems open. Currently we can have different namespaces. All of
> > them are attached to processes and can be removed with kill. Now this
> > code adds another copy to an (automatically created) cgroup.
> > 
> > IMHO, the cgroup should be destructed automatically if the nsproxy is
> > about to be die.
> 
> I certainly don't think your caution is unwarranted.  I like to keep the
> refcounting in all of this as simple as possible.

And as always those calling for caution are vindicated.  It turns out I
was grabbing a double-refcount on the nsproxy when a ns_cgroup is cloned.

After fixing that, I get warnings about potential circular locking
involving cgroup_mutex and namespace_sem.  This is because cgroup_mutex
depends on namespace_sem, but now doing rmdir on a once-filled ns_cgroup
calls put_fs_struct(ns_cgroup->fs).

But again, this patch was resent to solicit comment on the general
approach.  So I will put this patch aside again, unless I hear:

1. From Pavel, that he actually would like to use this approach for
namespace entering.

2. From Paul, that he still has a need for entering empty cgroups.

Otherwise, there is still the point of view (held I believe by Eric)
that the right thing to do is provide the monitoring and control over
containers that we need through proper namespace semantics and exported
filesystems.

thanks,
-serge

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-08-12 17:06 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-31 18:32 [PATCH 1/1] namespaces: introduce sys_hijack (v11) Serge E. Hallyn
     [not found] ` <20080731183213.GA12033-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-01  8:30   ` KOSAKI Motohiro
     [not found]     ` <20080801172811.FEC3.KOSAKI.MOTOHIRO-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2008-08-01 14:22       ` Serge E. Hallyn
2008-08-07 19:23       ` Serge E. Hallyn
2008-08-01  9:23   ` Bastian Blank
     [not found]     ` <20080801092318.GA2002-0IJIQSrh9RL9UF0aPl6fsj8Kkb2uy4ct@public.gmane.org>
2008-08-01 14:11       ` Serge E. Hallyn
     [not found]         ` <20080801141152.GA11553-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-01 15:51           ` Bastian Blank
     [not found]             ` <20080801155148.GA16760-0IJIQSrh9RL9UF0aPl6fsj8Kkb2uy4ct@public.gmane.org>
2008-08-01 16:39               ` Serge E. Hallyn
     [not found]                 ` <20080801163905.GA4647-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-01 17:19                   ` Bastian Blank
     [not found]                     ` <20080801171951.GA23754-0IJIQSrh9RL9UF0aPl6fsj8Kkb2uy4ct@public.gmane.org>
2008-08-01 17:38                       ` Serge E. Hallyn
     [not found]                         ` <20080801173817.GA21367-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-12 17:06                           ` Serge E. Hallyn
2008-08-01 17:22                   ` Bastian Blank

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.