public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC][v7][PATCH 0/9] Implement clone2() system call
@ 2009-09-24 16:55 Sukadev Bhattiprolu
  2009-09-24 17:00 ` [RFC][v7][PATCH 1/9]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
                   ` (10 more replies)
  0 siblings, 11 replies; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 16:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev


=== NEW CLONE() SYSTEM CALL:

To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed.  When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.

This patchset implements a new system call, clone2() that lets a process
specify the pids of the child process.

Patches 1 through 6 are helper patches, needed for choosing a pid for the
child process.

Patch 8 defines a prototype of the new system call. Patch 9 adds some
documentation on the new system call, some/all of which will eventually
go into a man page.

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann]
	  Rename clone_with_pids() to clone2(). Also group the arguments to
	  clone2() into a 'struct clone_struct' to workaround the issue of
	  exceeding 6 arguments to the system call. Also define clone-flags
	  as u64 to allow additional clone-flags.

Changelog[v6]:
	- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
	  Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
	  constant across architectures (Patches 7, 8).
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check (Patches 7,8)
	- (Pavel Machek) New patch (Patch 9) to add some documentation.

Changelog[v5]:
	- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
	  into this set)
	- (Eric Biederman): Avoid the new function, set_pidmap() - added
	  couple of checks on 'target_pid' in alloc_pidmap() itself.

=== IMPORTANT TODO:

clone() system call has another limitation - all available bits in clone-flags
are in use and any new clone-flag will need a variant of the clone() system
call. 

It appears to make sense to try and extend this new system call to address
this limitation as well. The requirements of a new clone system call could
then be summarized as:

	- do everything clone() does today, and
	- give application an ability to choose pids for the child process
	  in all ancestor pid namespaces, and
	- allow more clone_flags

Contstraints:

	- system-calls are restricted to 6 parameters and clone() already
	  takes 5 parameters, any extension to clone() interface would require
	  one or more copy_from_user().  (Not sure if copy_from_user() of ~40
	  bytes would have a significant impact on performance of clone() on
	  any architecture).

Based on these requirements and constraints, we explored a couple of system
call interfaces (in earlier versions of this patchset) and currently define
the system call as:

	struct clone_struct {
		u64 flags;
		u64 child_stack;
		u32 nr_pids;
		u32 parent_tid;
		u32 child_tid;
		u32 reserved1;
		u64 reserved2;
	};

	sys_clone2(struct clone_struct __user *cs, pid_t __user *pids)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC][v7][PATCH 1/9]: Factor out code to allocate pidmap page
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
@ 2009-09-24 17:00 ` Sukadev Bhattiprolu
  2009-09-24 17:00 ` [RFC][v7][PATCH 2/9]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 17:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev



Subject: [RFC][v7][PATCH 1/9]: Factor out code to allocate pidmap page

To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.

Changelog[v3]:
	- Earlier version of patchset called alloc_pidmap_page() from two
	  places. But now its called from only one place. Even so, moving
	  this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   45 ++++++++++++++++++++++++++++++---------------
 1 file changed, 30 insertions(+), 15 deletions(-)

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:06:25.000000000 -0700
@@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upi
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (map->page)
+		kfree(page);
+	else
+		map->page = page;
+	spin_unlock_irq(&pidmap_lock);
+
+	if (unlikely(!map->page))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int rc;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_names
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		rc = alloc_pidmap_page(map);
+		if (rc)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC][v7][PATCH 2/9]: Have alloc_pidmap() return actual error code
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
  2009-09-24 17:00 ` [RFC][v7][PATCH 1/9]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
@ 2009-09-24 17:00 ` Sukadev Bhattiprolu
  2009-09-24 17:01 ` [RFC][v7][PATCH 3/9] Make pid_max a pid_ns property Sukadev Bhattiprolu
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 17:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev



Subject: [RFC][v7][PATCH 2/9]: Have alloc_pidmap() return actual error code

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |   14 +++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-09 19:06:46.000000000 -0700
@@ -1110,10 +1110,11 @@ static struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:25.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:06:46.000000000 -0700
@@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidm
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
-	int rc;
+	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_names
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
+			if (unlikely(last == offset)) {
+				rc = -EAGAIN;
 				break;
+			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespa
 	struct upid *upid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
@@ -299,7 +303,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC][v7][PATCH 3/9] Make pid_max a pid_ns property
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
  2009-09-24 17:00 ` [RFC][v7][PATCH 1/9]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
  2009-09-24 17:00 ` [RFC][v7][PATCH 2/9]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
@ 2009-09-24 17:01 ` Sukadev Bhattiprolu
  2009-09-24 17:45   ` Oren Laadan
  2009-09-24 17:01 ` [RFC][v7][PATCH 4/9]: Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 17:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev



From: Serge Hallyn <serue@us.ibm.com>
Subject: [RFC][v7][PATCH 3/9] Make pid_max a pid_ns property

Remove the pid_max global, and make it a property of the
pid_namespace.  When a pid_ns is created, it inherits
the parent's pid_ns.

Fixing up sysctl (trivial akin to ipc version, but
potentially tedious to get right for all CONFIG*
combinations) is left for later.

Changelog[v2]:
	- Port to newer kernel
	- Make pid_max a local variable in alloc_pidmap() to simplify code/patch

Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
---
 include/linux/pid_namespace.h |    1 +
 kernel/pid.c                  |    4 ++--
 kernel/pid_namespace.c        |    1 +
 kernel/sysctl.c               |    4 ++--
 4 files changed, 6 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/pid_namespace.h
===================================================================
--- linux-2.6.orig/include/linux/pid_namespace.h	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/include/linux/pid_namespace.h	2009-09-09 19:07:20.000000000 -0700
@@ -30,6 +30,7 @@ struct pid_namespace {
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct bsd_acct_struct *bacct;
 #endif
+	int pid_max;
 };
 
 extern struct pid_namespace init_pid_ns;
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:06:46.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-09 19:07:20.000000000 -0700
@@ -43,8 +43,6 @@ static struct hlist_head *pid_hash;
 static int pidhash_shift;
 struct pid init_struct_pid = INIT_STRUCT_PID;
 
-int pid_max = PID_MAX_DEFAULT;
-
 #define RESERVED_PIDS		300
 
 int pid_max_min = RESERVED_PIDS + 1;
@@ -78,6 +76,7 @@ struct pid_namespace init_pid_ns = {
 	.last_pid = 0,
 	.level = 0,
 	.child_reaper = &init_task,
+	.pid_max = PID_MAX_DEFAULT,
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
@@ -151,6 +150,7 @@ static int alloc_pidmap(struct pid_names
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
+	int pid_max = pid_ns->pid_max;
 	struct pidmap *map;
 
 	pid = last + 1;
Index: linux-2.6/kernel/pid_namespace.c
===================================================================
--- linux-2.6.orig/kernel/pid_namespace.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/pid_namespace.c	2009-09-09 19:07:20.000000000 -0700
@@ -87,6 +87,7 @@ static struct pid_namespace *create_pid_
 
 	kref_init(&ns->kref);
 	ns->level = level;
+	ns->pid_max = parent_pid_ns->pid_max;
 	ns->parent = get_pid_ns(parent_pid_ns);
 
 	set_bit(0, ns->pidmap[0].page);
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2009-09-09 19:06:21.000000000 -0700
+++ linux-2.6/kernel/sysctl.c	2009-09-09 19:07:20.000000000 -0700
@@ -55,6 +55,7 @@
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
+#include <linux/pid_namespace.h>
 
 #ifdef CONFIG_X86
 #include <asm/nmi.h>
@@ -78,7 +79,6 @@ extern int max_threads;
 extern int core_uses_pid;
 extern int suid_dumpable;
 extern char core_pattern[];
-extern int pid_max;
 extern int min_free_kbytes;
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
@@ -670,7 +670,7 @@ static struct ctl_table kern_table[] = {
 	{
 		.ctl_name	= KERN_PIDMAX,
 		.procname	= "pid_max",
-		.data		= &pid_max,
+		.data		= &init_pid_ns.pid_max,
 		.maxlen		= sizeof (int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec_minmax,

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC][v7][PATCH 4/9]: Add target_pid parameter to alloc_pidmap()
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
                   ` (2 preceding siblings ...)
  2009-09-24 17:01 ` [RFC][v7][PATCH 3/9] Make pid_max a pid_ns property Sukadev Bhattiprolu
@ 2009-09-24 17:01 ` Sukadev Bhattiprolu
  2009-09-24 17:47   ` Oren Laadan
  2009-09-24 17:01 ` [RFC][v7][PATCH 5/9]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 17:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev



Subject: [RFC][v7][PATCH 4/9]: Add target_pid parameter to alloc_pidmap()

With support for setting a specific pid number for a process,
alloc_pidmap() will need a 'target_pid' parameter.

Changelog[v6]:
	- Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
	- (Eric Biederman): Avoid set_pidmap() function. Added couple of
	  checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
	- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
	  actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 kernel/pid.c |   26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-09 19:07:20.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-10 10:23:27.000000000 -0700
@@ -146,16 +146,22 @@ static int alloc_pidmap_page(struct pidm
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	int rc = -EAGAIN;
 	int pid_max = pid_ns->pid_max;
 	struct pidmap *map;
 
-	pid = last + 1;
-	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+	if (target_pid) {
+		pid = target_pid;
+		if (pid < 0 || pid >= pid_max)
+			return -EINVAL;
+	} else {
+		pid = last + 1;
+		if (pid >= pid_max)
+			pid = RESERVED_PIDS;
+	}
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
@@ -164,6 +170,15 @@ static int alloc_pidmap(struct pid_names
 		if (rc)
 			break;
 
+		if (target_pid) {
+			rc = -EBUSY;
+			if (!test_and_set_bit(offset, map->page)) {
+				atomic_dec(&map->nr_free);
+				rc = pid;
+			}
+			return rc;
+		}
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
@@ -196,6 +211,7 @@ static int alloc_pidmap(struct pid_names
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
+
 	return rc;
 }
 
@@ -272,7 +288,7 @@ struct pid *alloc_pid(struct pid_namespa
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, 0);
 		if (nr < 0)
 			goto out_free;
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC][v7][PATCH 5/9]: Add target_pids parameter to alloc_pid()
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
                   ` (3 preceding siblings ...)
  2009-09-24 17:01 ` [RFC][v7][PATCH 4/9]: Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
@ 2009-09-24 17:01 ` Sukadev Bhattiprolu
  2009-09-24 17:02 ` [RFC][v7][PATCH 6/9]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 17:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev



Subject: [RFC][v7][PATCH 5/9]: Add target_pids parameter to alloc_pid()

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |    9 +++++++--
 3 files changed, 10 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/pid.h
===================================================================
--- linux-2.6.orig/include/linux/pid.h	2009-09-10 10:19:20.000000000 -0700
+++ linux-2.6/include/linux/pid.h	2009-09-10 10:29:10.000000000 -0700
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-10 10:19:20.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-10 10:29:10.000000000 -0700
@@ -940,6 +940,7 @@ static struct task_struct *copy_process(
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1110,7 +1111,7 @@ static struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c	2009-09-10 10:23:27.000000000 -0700
+++ linux-2.6/kernel/pid.c	2009-09-10 10:29:10.000000000 -0700
@@ -272,13 +272,14 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	int tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid) {
@@ -288,7 +289,11 @@ struct pid *alloc_pid(struct pid_namespa
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp, 0);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = alloc_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC][v7][PATCH 6/9]: Add target_pids parameter to copy_process()
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
                   ` (4 preceding siblings ...)
  2009-09-24 17:01 ` [RFC][v7][PATCH 5/9]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
@ 2009-09-24 17:02 ` Sukadev Bhattiprolu
  2009-09-24 17:02 ` [RFC][v7][PATCH 7/9]: Define do_fork_with_pids() Sukadev Bhattiprolu
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 17:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev



Subject: [RFC][v7][PATCH 6/9]: Add target_pids parameter to copy_process()

Add a 'target_pids' parameter to copy_process().  The new parameter will be
used in a follow-on patch when clone_with_pids() is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-10 10:29:10.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-10 10:29:13.000000000 -0700
@@ -935,12 +935,12 @@ static struct task_struct *copy_process(
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1319,7 +1319,7 @@ struct task_struct * __cpuinit fork_idle
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1342,6 +1342,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1382,7 +1383,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC][v7][PATCH 7/9]: Define do_fork_with_pids()
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
                   ` (5 preceding siblings ...)
  2009-09-24 17:02 ` [RFC][v7][PATCH 6/9]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
@ 2009-09-24 17:02 ` Sukadev Bhattiprolu
  2009-09-24 17:03 ` [RFC][v7][PATCH 8/9]: Define clone2() syscall Sukadev Bhattiprolu
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 17:02 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev



Subject: [RFC][v7][PATCH 7/9]: Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v7]:
	- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
	  instead of 'struct pid_set *'.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
	- Rename 'struct target_pid_set' to 'struct pid_set' since it may
	  be useful in other contexts.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/sched.h |    3 +++
 kernel/fork.c         |   17 +++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h	2009-09-11 18:44:04.000000000 -0700
+++ linux-2.6/include/linux/sched.h	2009-09-12 09:43:20.000000000 -0700
@@ -2054,6 +2054,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				unsigned int, pid_t __user *);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-11 20:09:15.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-12 11:17:21.000000000 -0700
@@ -1332,12 +1332,14 @@ struct task_struct * __cpuinit fork_idle
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      unsigned int num_pids,
+	      pid_t __user *upids)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1440,6 +1442,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, 0, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
                   ` (6 preceding siblings ...)
  2009-09-24 17:02 ` [RFC][v7][PATCH 7/9]: Define do_fork_with_pids() Sukadev Bhattiprolu
@ 2009-09-24 17:03 ` Sukadev Bhattiprolu
  2009-09-24 21:43   ` Arnd Bergmann
  2009-09-24 17:03 ` [RFC][v7][PATCH 9/9]: Document " Sukadev Bhattiprolu
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 17:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev



Subject: [RFC][v7][PATCH 8/9]: Define clone2() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone2(), intended for use during restart, is the same as clone(), except
that it takes a 'pids' paramter. This parameter lets caller choose
specific pid numbers for the child process, in the process's active and
ancestor pid namespaces. (Descendant pid namespaces in general don't matter
since processes don't have pids in them anyway, but see comments in
copy_target_pids() regarding CLONE_NEWPID).

Clone2() system call also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and most (all ?)
of these are in use. If a new clone flag is needed, we will be forced to
define a new variant of the clone() system call.

To prevent unprivileged processes from misusing this interface, clone2()
currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.

See Documentation/clone2 in next patch for more details of clone2() and an
example of its usage.

NOTE:
	System calls are restricted to 6 parameters and the number and sizes
	of parameters needed for sys_clone2() exceed 6 integers. The new
	prototype works around this restriction while providing some
	flexibility if clone2() needs to be further extended in the future.
TODO:
	- May need additional sanity checks in do_fork_with_pids().

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
	  and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 arch/x86/include/asm/syscalls.h    |    1 
 arch/x86/include/asm/unistd_32.h   |    1 
 arch/x86/kernel/process_32.c       |   31 +++++++++++
 arch/x86/kernel/syscall_table_32.S |    1 
 include/linux/types.h              |   10 +++
 kernel/fork.c                      |   96 ++++++++++++++++++++++++++++++++++++-
 6 files changed, 139 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/include/asm/syscalls.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/syscalls.h	2009-09-18 17:39:46.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/syscalls.h	2009-09-18 17:39:50.000000000 -0700
@@ -55,6 +55,7 @@ struct sel_arg_struct;
 struct oldold_utsname;
 struct old_utsname;
 
+asmlinkage long sys_clone2(struct clone_struct __user *cs, pid_t *pids);
 asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
 			  unsigned long, unsigned long, unsigned long);
 asmlinkage int old_mmap(struct mmap_arg_struct __user *);
Index: linux-2.6/arch/x86/include/asm/unistd_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/unistd_32.h	2009-09-18 17:39:46.000000000 -0700
+++ linux-2.6/arch/x86/include/asm/unistd_32.h	2009-09-18 17:39:50.000000000 -0700
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_counter_open	336
+#define __NR_clone2		337
 
 #ifdef __KERNEL__
 
Index: linux-2.6/arch/x86/kernel/process_32.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/process_32.c	2009-09-18 17:39:46.000000000 -0700
+++ linux-2.6/arch/x86/kernel/process_32.c	2009-09-18 17:40:05.000000000 -0700
@@ -443,6 +443,37 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+asmlinkage long
+sys_clone2(struct clone_struct __user *ucs, pid_t __user *pids)
+{
+	int rc;
+	struct clone_struct kcs;
+	unsigned long clone_flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack_base;
+	struct pt_regs *regs;
+
+	rc = copy_from_user(&ucs, &kcs, sizeof(kcs));
+	if (rc)
+		return -EFAULT;
+
+	/*
+	 * TODO: Convert clone_flags to 64-bit
+	 */
+	clone_flags = (unsigned long)kcs.flags;
+	child_stack_base = (unsigned long)kcs.child_stack;
+	parent_tid_ptr = &ucs.parent_tid;
+	child_tid_ptr =  &ucs.child_tid;
+	regs = task_pt_regs(current);
+
+	if (!child_stack_base)
+		child_stack_base = user_stack_pointer(regs);
+
+	return do_fork_with_pids(clone_flags, child_stack_base, regs, 0,
+			parent_tid_ptr, child_tid_ptr, kcs.nr_pids, pids);
+}
+
 /*
  * sys_execve() executes a new program.
  */
Index: linux-2.6/arch/x86/kernel/syscall_table_32.S
===================================================================
--- linux-2.6.orig/arch/x86/kernel/syscall_table_32.S	2009-09-18 17:39:46.000000000 -0700
+++ linux-2.6/arch/x86/kernel/syscall_table_32.S	2009-09-18 17:39:50.000000000 -0700
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_counter_open
+	.long sys_clone2
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c	2009-09-18 17:39:50.000000000 -0700
+++ linux-2.6/kernel/fork.c	2009-09-18 17:39:50.000000000 -0700
@@ -1327,6 +1327,86 @@ struct task_struct * __cpuinit fork_idle
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+	int j;
+	int rc;
+	int size;
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!unum_pids)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], upids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1344,7 +1424,7 @@ long do_fork_with_pids(unsigned long clo
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1378,6 +1458,16 @@ long do_fork_with_pids(unsigned long clo
 		}
 	}
 
+	target_pids = copy_target_pids(num_pids, upids);
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1439,6 +1529,10 @@ long do_fork_with_pids(unsigned long clo
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 
Index: linux-2.6/include/linux/types.h
===================================================================
--- linux-2.6.orig/include/linux/types.h	2009-09-18 17:39:46.000000000 -0700
+++ linux-2.6/include/linux/types.h	2009-09-18 17:39:50.000000000 -0700
@@ -204,6 +204,16 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct clone_struct {
+	u64 flags;
+	u64 child_stack;
+	u32 nr_pids;
+	u32 parent_tid;
+	u32 child_tid;
+	u32 reserved1;
+	u64 reserved2;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC][v7][PATCH 9/9]: Document clone2() syscall
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
                   ` (7 preceding siblings ...)
  2009-09-24 17:03 ` [RFC][v7][PATCH 8/9]: Define clone2() syscall Sukadev Bhattiprolu
@ 2009-09-24 17:03 ` Sukadev Bhattiprolu
  2009-09-24 18:05   ` Randy Dunlap
  2009-09-25  2:31   ` KOSAKI Motohiro
  2009-09-24 17:44 ` [RFC][v7][PATCH 0/9] Implement clone2() system call Oren Laadan
  2009-09-24 17:57 ` Alexey Dobriyan
  10 siblings, 2 replies; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 17:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev



Subject: [RFC][v7][PATCH 9/9]: Document clone2() syscall

This gives a brief overview of the clone2() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Signed-off-by: Sukadev Bhattiprolu <sukadev@vnet.linux.ibm.com>
---
 Documentation/clone2 |   85 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

Index: linux-2.6/Documentation/clone2
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/clone2	2009-09-18 18:48:00.000000000 -0700
@@ -0,0 +1,85 @@
+
+struct clone_struct {
+	u64 flags;
+	u64 child_stack;
+	u32 nr_pids;
+	u32 parent_tid;
+	u32 child_tid;
+	u32 reserved1;
+	u64 reserved2;
+};
+
+clone2(struct clone_struct * __user clone_args, pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does,
+	the clone2() system call:
+
+		- allows additional clone flags (all 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid name spaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint.  Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @pids defines the set of pids that should be assigned to the child
+	process in its active and ancestor pid name spaces. The descendant pid
+	namespaces do not matter since a process does not have a pid in
+	descendant namespaces, unless the process is in a new pid namespace
+	in which case the process is a container-init (and must have the pid 1
+	in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	The order pids in @pids corresponds to the nesting order of pid-
+	namespaces, with @pids[0] corresponding to the init_pid_ns.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace, for the process.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails with -EBUSY.
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, clone2() returns -1 and sets 'errno' to one of following
+	values (the child process is not created).
+
+	EPERM	Caller does not have the SYS_ADMIN privilege needed to excute
+		this call.
+
+	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EBUSY	A requested pid is in use by another process in that name space.
+
+Example:
+
+	pid_t pids[] = { 77, 99 };
+	struct clone_struct cs;
+
+	cs.flags = (u64) SIGCHLD;
+	cs.child_stack = (u64) setup_child_stack();
+	cs.nr_pids = 2;
+	cs.parent_tid = 0;
+	cs.child_tid = 0;
+
+	rc = syscall(__NR_clone2, &cs, pids);
+
+	if (rc < 0) {
+		perror("clone2()");
+		exit(1);
+	} else if (rc) {
+		/* Parent */
+	} else {
+		/* Child */
+	}
+

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
                   ` (8 preceding siblings ...)
  2009-09-24 17:03 ` [RFC][v7][PATCH 9/9]: Document " Sukadev Bhattiprolu
@ 2009-09-24 17:44 ` Oren Laadan
  2009-09-24 20:15   ` Sukadev Bhattiprolu
  2009-10-01  2:36   ` Sukadev Bhattiprolu
  2009-09-24 17:57 ` Alexey Dobriyan
  10 siblings, 2 replies; 45+ messages in thread
From: Oren Laadan @ 2009-09-24 17:44 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, arnd, Containers, Nathan Lynch, Eric W. Biederman,
	hpa, mingo, torvalds, Alexey Dobriyan, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> === NEW CLONE() SYSTEM CALL:
> 
> To support application checkpoint/restart, a task must have the same pid it
> had when it was checkpointed.  When containers are nested, the tasks within
> the containers exist in multiple pid namespaces and hence have multiple pids
> to specify during restart.
> 
> This patchset implements a new system call, clone2() that lets a process
> specify the pids of the child process.
> 
> Patches 1 through 6 are helper patches, needed for choosing a pid for the
> child process.
> 
> Patch 8 defines a prototype of the new system call. Patch 9 adds some
> documentation on the new system call, some/all of which will eventually
> go into a man page.
> 

[...]

> 
> Based on these requirements and constraints, we explored a couple of system
> call interfaces (in earlier versions of this patchset) and currently define
> the system call as:
> 
> 	struct clone_struct {
> 		u64 flags;
> 		u64 child_stack;
> 		u32 nr_pids;
> 		u32 parent_tid;
> 		u32 child_tid;

So @parent_tid and @child_tid are pointers to userspace memory and
require 'u64' (and it won't hurt to make @reserved1 a 'u64' as well).

> 		u32 reserved1;
> 		u64 reserved2;
> 	};
> 

Also, for forward/backward compatibility, explicitly state in the
documentation, and enforce in the kernel, that flags which are not
defined must not be set, and that reserved{1,2} must remain 0.

> 	sys_clone2(struct clone_struct __user *cs, pid_t __user *pids)
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

Otherwise, looks great.

Oren.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 3/9] Make pid_max a pid_ns property
  2009-09-24 17:01 ` [RFC][v7][PATCH 3/9] Make pid_max a pid_ns property Sukadev Bhattiprolu
@ 2009-09-24 17:45   ` Oren Laadan
  0 siblings, 0 replies; 45+ messages in thread
From: Oren Laadan @ 2009-09-24 17:45 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, arnd, Containers, Nathan Lynch, Eric W. Biederman,
	hpa, mingo, torvalds, Alexey Dobriyan, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> 
> From: Serge Hallyn <serue@us.ibm.com>
> Subject: [RFC][v7][PATCH 3/9] Make pid_max a pid_ns property
> 
> Remove the pid_max global, and make it a property of the
> pid_namespace.  When a pid_ns is created, it inherits
> the parent's pid_ns.
> 
> Fixing up sysctl (trivial akin to ipc version, but
> potentially tedious to get right for all CONFIG*
> combinations) is left for later.
> 
> Changelog[v2]:
> 	- Port to newer kernel
> 	- Make pid_max a local variable in alloc_pidmap() to simplify code/patch
> 
> Signed-off-by: Serge Hallyn <serue@us.ibm.com>
> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>

Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 4/9]: Add target_pid parameter to alloc_pidmap()
  2009-09-24 17:01 ` [RFC][v7][PATCH 4/9]: Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
@ 2009-09-24 17:47   ` Oren Laadan
  0 siblings, 0 replies; 45+ messages in thread
From: Oren Laadan @ 2009-09-24 17:47 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, arnd, Containers, Nathan Lynch, Eric W. Biederman,
	hpa, mingo, torvalds, Alexey Dobriyan, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> 
> Subject: [RFC][v7][PATCH 4/9]: Add target_pid parameter to alloc_pidmap()
> 
> With support for setting a specific pid number for a process,
> alloc_pidmap() will need a 'target_pid' parameter.
> 
> Changelog[v6]:
> 	- Separate target_pid > 0 case to minimize the number of checks needed.
> Changelog[v3]:
> 	- (Eric Biederman): Avoid set_pidmap() function. Added couple of
> 	  checks for target_pid in alloc_pidmap() itself.
> Changelog[v2]:
> 	- (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
> 	  actually checks for 'pid <= 0' for completeness).
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Acked-by: Serge Hallyn <serue@us.ibm.com>

Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
                   ` (9 preceding siblings ...)
  2009-09-24 17:44 ` [RFC][v7][PATCH 0/9] Implement clone2() system call Oren Laadan
@ 2009-09-24 17:57 ` Alexey Dobriyan
  2009-09-24 18:35   ` Serge E. Hallyn
  10 siblings, 1 reply; 45+ messages in thread
From: Alexey Dobriyan @ 2009-09-24 17:57 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev

I don't like this even more.

Pid namespaces are hierarchical _and_ anonymous, so simply
set of numbers doesn't describe the final object.

struct pid isn't special, it's just another invariant if you like
as far as C/R is concerned, but system call is made special wrt pids.

What will be in an image? I hope "struct kstate_image_pid" with several
numbers and with references to such object from other places, so it
seems natural to do alloc_pid() with needed numbers and that attach new
shiny pid to where needed. But this clone_pid is only for task_struct's pids.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 9/9]: Document clone2() syscall
  2009-09-24 17:03 ` [RFC][v7][PATCH 9/9]: Document " Sukadev Bhattiprolu
@ 2009-09-24 18:05   ` Randy Dunlap
  2009-09-25  2:31   ` KOSAKI Motohiro
  1 sibling, 0 replies; 45+ messages in thread
From: Randy Dunlap @ 2009-09-24 18:05 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, arnd, peterz, Containers, sukadev

Sukadev Bhattiprolu wrote:
> 
> Subject: [RFC][v7][PATCH 9/9]: Document clone2() syscall
> 
> This gives a brief overview of the clone2() system call.  We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.

Hi,

We have a separate mailing list (linux-api@vger.kernel.org)
where new kernel APIs are (or were?) meant to be discussed/checked/tested.

Maybe Michael Kerrisk would care (or would have cared?) about this.

I don't see linux-api@vger.kernel.org listed in MAINTAINERS,
but it is referred to in Documentation/HOWTO and Documentation/SubmitChecklist.
Does it need to be listed in MAINTAINERS?
(oh, you didn't read Documentation/SubmitChecklist ??)

Anyway, please cc: linux-api@vger.kernel.org on future patches like this
series.


> Changelog[v7]:
> 	- Rename clone_with_pids() to clone2()
> 	- Changes to reflect new prototype of clone2() (using clone_struct).
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@vnet.linux.ibm.com>
> ---
>  Documentation/clone2 |   85 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 85 insertions(+)
> 
> Index: linux-2.6/Documentation/clone2
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/Documentation/clone2	2009-09-18 18:48:00.000000000 -0700
> @@ -0,0 +1,85 @@
> +
> +struct clone_struct {
> +	u64 flags;
> +	u64 child_stack;
> +	u32 nr_pids;
> +	u32 parent_tid;
> +	u32 child_tid;
> +	u32 reserved1;
> +	u64 reserved2;
> +};
> +
> +clone2(struct clone_struct * __user clone_args, pid_t * __user pids)
> +
> +	In addition to doing everything that clone() system call does,
> +	the clone2() system call:
> +
> +		- allows additional clone flags (all 32 bits in the flags
> +		  parameter to clone() are in use)
> +
> +		- allows user to specify a pid for the child process in its
> +		  active and ancestor pid name spaces.
> +
> +	This system call is meant to be used when restarting an application
> +	from a checkpoint.  Such restart requires that the processes in the
> +	application have the same pids they had when the application was
> +	checkpointed. When containers are nested, the processes within the
> +	containers exist in multiple pid namespaces and hence have multiple
> +	pids to specify during restart.
> +
> +	The @pids defines the set of pids that should be assigned to the child
> +	process in its active and ancestor pid name spaces. The descendant pid
> +	namespaces do not matter since a process does not have a pid in
> +	descendant namespaces, unless the process is in a new pid namespace
> +	in which case the process is a container-init (and must have the pid 1
> +	in that namespace).
> +
> +	See CLONE_NEWPID section of clone(2) man page for details about pid
> +	namespaces.
> +
> +	The order pids in @pids corresponds to the nesting order of pid-
> +	namespaces, with @pids[0] corresponding to the init_pid_ns.
> +
> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace, for the process.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails with -EBUSY.
> +
> +	On success, the system call returns the pid of the child process in
> +	the parent's active pid namespace.
> +
> +	On failure, clone2() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).
> +
> +	EPERM	Caller does not have the SYS_ADMIN privilege needed to excute
> +		this call.
> +
> +	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
> +		the current nesting level of parent process
> +
> +	EBUSY	A requested pid is in use by another process in that name space.
> +
> +Example:
> +
> +	pid_t pids[] = { 77, 99 };
> +	struct clone_struct cs;
> +
> +	cs.flags = (u64) SIGCHLD;
> +	cs.child_stack = (u64) setup_child_stack();
> +	cs.nr_pids = 2;
> +	cs.parent_tid = 0;
> +	cs.child_tid = 0;
> +
> +	rc = syscall(__NR_clone2, &cs, pids);
> +
> +	if (rc < 0) {
> +		perror("clone2()");
> +		exit(1);
> +	} else if (rc) {
> +		/* Parent */
> +	} else {
> +		/* Child */
> +	}
> +


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-24 17:57 ` Alexey Dobriyan
@ 2009-09-24 18:35   ` Serge E. Hallyn
  2009-09-30  5:34     ` Alexey Dobriyan
  0 siblings, 1 reply; 45+ messages in thread
From: Serge E. Hallyn @ 2009-09-24 18:35 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan, Eric W. Biederman,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev

Quoting Alexey Dobriyan (adobriyan@gmail.com):
> I don't like this even more.
> 
> Pid namespaces are hierarchical _and_ anonymous, so simply
> set of numbers doesn't describe the final object.
> 
> struct pid isn't special, it's just another invariant if you like
> as far as C/R is concerned, but system call is made special wrt pids.
> 
> What will be in an image? I hope "struct kstate_image_pid" with several

Sure pid namespaces are anonymous, but we will give each an 'objref'
valid only for a checkpoint image, and store the relationship between
pid namespaces based on those objrefs.  Basically the same way that user
structs and hierarchical user namespaces are handled right now.

> numbers and with references to such object from other places, so it
> seems natural to do alloc_pid() with needed numbers and that attach new
> shiny pid to where needed. But this clone_pid is only for task_struct's pids.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-24 17:44 ` [RFC][v7][PATCH 0/9] Implement clone2() system call Oren Laadan
@ 2009-09-24 20:15   ` Sukadev Bhattiprolu
  2009-09-24 22:06     ` Oren Laadan
  2009-10-01  2:36   ` Sukadev Bhattiprolu
  1 sibling, 1 reply; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-24 20:15 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-kernel, arnd, Containers, Nathan Lynch, Eric W. Biederman,
	hpa, mingo, torvalds, Alexey Dobriyan, Pavel Emelyanov

Oren Laadan [orenl@librato.com] wrote:
| 
| 
| Sukadev Bhattiprolu wrote:
| > Based on these requirements and constraints, we explored a couple of system
| > call interfaces (in earlier versions of this patchset) and currently define
| > the system call as:
| > 
| > 	struct clone_struct {
| > 		u64 flags;
| > 		u64 child_stack;
| > 		u32 nr_pids;
| > 		u32 parent_tid;
| > 		u32 child_tid;
| 
| So @parent_tid and @child_tid are pointers to userspace memory and
| require 'u64' (and it won't hurt to make @reserved1 a 'u64' as well).

No, as Arnd pointed out, we already pass in a pointer to 'struct clone_struct'
and the kernel can use that pointer to copy the parent and child tids.

| 
| > 		u32 reserved1;
| > 		u64 reserved2;
| > 	};
| > 
| 
| Also, for forward/backward compatibility, explicitly state in the
| documentation, and enforce in the kernel, that flags which are not
| defined must not be set, and that reserved{1,2} must remain 0.

Good idea. Will do.

Thanks,

Sukadev

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-24 17:03 ` [RFC][v7][PATCH 8/9]: Define clone2() syscall Sukadev Bhattiprolu
@ 2009-09-24 21:43   ` Arnd Bergmann
  2009-09-25  8:23     ` Louis Rilling
  0 siblings, 1 reply; 45+ messages in thread
From: Arnd Bergmann @ 2009-09-24 21:43 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, Oren Laadan, serue, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton, torvalds, mikew,
	mingo, hpa, Nathan Lynch, peterz, Containers, sukadev

On Thursday 24 September 2009, Sukadev Bhattiprolu wrote:
> +asmlinkage long
> +sys_clone2(struct clone_struct __user *ucs, pid_t __user *pids)
> +{
> +       int rc;
> +       struct clone_struct kcs;
> +       unsigned long clone_flags;
> +       int __user *parent_tid_ptr;
> +       int __user *child_tid_ptr;
> +       unsigned long __user child_stack_base;
> +       struct pt_regs *regs;
> +
> +       rc = copy_from_user(&ucs, &kcs, sizeof(kcs));
> +       if (rc)
> +               return -EFAULT;
> +
> +       /*
> +        * TODO: Convert clone_flags to 64-bit
> +        */
> +       clone_flags = (unsigned long)kcs.flags;
> +       child_stack_base = (unsigned long)kcs.child_stack;
> +       parent_tid_ptr = &ucs.parent_tid;
> +       child_tid_ptr =  &ucs.child_tid;
> +       regs = task_pt_regs(current);
> +
> +       if (!child_stack_base)
> +               child_stack_base = user_stack_pointer(regs);
> +
> +       return do_fork_with_pids(clone_flags, child_stack_base, regs, 0,
> +                       parent_tid_ptr, child_tid_ptr, kcs.nr_pids, pids);
> +}
> +

The function looks ok, but you have put it into arch/x86/kernel/process_32.c,
which is specific to x86-32. Since the code in this form is generic, why
not just put it into kernel/fork.c? You should probably enclose it within
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK to make sure that user_stack_pointer()
is implemented, but then it would be immediately usable by the nine architectures
implementing that. The other architectures can then decide between adding
their private version of sys_clone2 with an open-coded user_stack_pointer
implementation or alternatively implement the tracehooks.

	Arnd <><

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-24 20:15   ` Sukadev Bhattiprolu
@ 2009-09-24 22:06     ` Oren Laadan
  2009-09-24 22:21       ` Arnd Bergmann
  0 siblings, 1 reply; 45+ messages in thread
From: Oren Laadan @ 2009-09-24 22:06 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, arnd, Containers, Nathan Lynch, Eric W. Biederman,
	hpa, mingo, torvalds, Alexey Dobriyan, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl@librato.com] wrote:
> | 
> | 
> | Sukadev Bhattiprolu wrote:
> | > Based on these requirements and constraints, we explored a couple of system
> | > call interfaces (in earlier versions of this patchset) and currently define
> | > the system call as:
> | > 
> | > 	struct clone_struct {
> | > 		u64 flags;
> | > 		u64 child_stack;
> | > 		u32 nr_pids;
> | > 		u32 parent_tid;
> | > 		u32 child_tid;
> | 
> | So @parent_tid and @child_tid are pointers to userspace memory and
> | require 'u64' (and it won't hurt to make @reserved1 a 'u64' as well).
> 
> No, as Arnd pointed out, we already pass in a pointer to 'struct clone_struct'
> and the kernel can use that pointer to copy the parent and child tids.

In this form, you place a constraints on where userspace may
place the {parent,child}_tid variable, and require that this
particular clone_struct remain valid memory in the parent until
the child terminates.  This may break existing programs that
use this (threads libraries ?)

Oren.

> 
> | 
> | > 		u32 reserved1;
> | > 		u64 reserved2;
> | > 	};
> | > 
> | 
> | Also, for forward/backward compatibility, explicitly state in the
> | documentation, and enforce in the kernel, that flags which are not
> | defined must not be set, and that reserved{1,2} must remain 0.
> 
> Good idea. Will do.
> 
> Thanks,
> 
> Sukadev

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-24 22:06     ` Oren Laadan
@ 2009-09-24 22:21       ` Arnd Bergmann
  2009-09-24 23:19         ` Oren Laadan
  0 siblings, 1 reply; 45+ messages in thread
From: Arnd Bergmann @ 2009-09-24 22:21 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Sukadev Bhattiprolu, linux-kernel, Containers, Nathan Lynch,
	Eric W. Biederman, hpa, mingo, torvalds, Alexey Dobriyan,
	Pavel Emelyanov

On Friday 25 September 2009, Oren Laadan wrote:
> In this form, you place a constraints on where userspace may
> place the {parent,child}_tid variable, and require that this
> particular clone_struct remain valid memory in the parent until
> the child terminates.  This may break existing programs that
> use this (threads libraries ?)

No existing program uses sys_clone2, and the kernel function
may well differ from the user space calling conventions, which
are not bound by the six-argument limitation.

So a clone2 library call could set up the structure with the
arguments to the real syscall, call into the kernel and
copy the output data back into the pointers it was given by
the user.

	Arnd <><

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-24 22:21       ` Arnd Bergmann
@ 2009-09-24 23:19         ` Oren Laadan
  0 siblings, 0 replies; 45+ messages in thread
From: Oren Laadan @ 2009-09-24 23:19 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Sukadev Bhattiprolu, linux-kernel, Containers, Nathan Lynch,
	Eric W. Biederman, hpa, mingo, torvalds, Alexey Dobriyan,
	Pavel Emelyanov



Arnd Bergmann wrote:
> On Friday 25 September 2009, Oren Laadan wrote:
>> In this form, you place a constraints on where userspace may
>> place the {parent,child}_tid variable, and require that this
>> particular clone_struct remain valid memory in the parent until
>> the child terminates.  This may break existing programs that
>> use this (threads libraries ?)
> 
> No existing program uses sys_clone2, and the kernel function
> may well differ from the user space calling conventions, which
> are not bound by the six-argument limitation.
> 
> So a clone2 library call could set up the structure with the
> arguments to the real syscall, call into the kernel and
> copy the output data back into the pointers it was given by
> the user.

That may work well for parent_tid, however child_tid is also
kept on the task_struct and written to when the child exits,
and there is no explicit user-space wrapper on that.

Also, I may be mistaken, but I thought that the idea of these
was that the kernel writes them to user space, so other threads
may see them quickly, _before_ the parent returns to userspace;
otherwise, the parent (at the library) could himself copy the
return value.

Oren.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 9/9]: Document clone2() syscall
  2009-09-24 17:03 ` [RFC][v7][PATCH 9/9]: Document " Sukadev Bhattiprolu
  2009-09-24 18:05   ` Randy Dunlap
@ 2009-09-25  2:31   ` KOSAKI Motohiro
  1 sibling, 0 replies; 45+ messages in thread
From: KOSAKI Motohiro @ 2009-09-25  2:31 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: kosaki.motohiro, linux-kernel, Oren Laadan, serue,
	Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov,
	Andrew Morton, torvalds, mikew, mingo, hpa, Nathan Lynch, arnd,
	peterz, Containers, sukadev

> 
> 
> Subject: [RFC][v7][PATCH 9/9]: Document clone2() syscall
> 
> This gives a brief overview of the clone2() system call.  We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.
> 
> Changelog[v7]:
> 	- Rename clone_with_pids() to clone2()
> 	- Changes to reflect new prototype of clone2() (using clone_struct).
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@vnet.linux.ibm.com>

AFAIK, ia64 already have clone2() systemcall and it is imcompatible this.





^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-24 21:43   ` Arnd Bergmann
@ 2009-09-25  8:23     ` Louis Rilling
  2009-09-25 10:56       ` Louis Rilling
  0 siblings, 1 reply; 45+ messages in thread
From: Louis Rilling @ 2009-09-25  8:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Sukadev Bhattiprolu, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, hpa, mingo, torvalds, Alexey Dobriyan,
	Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 2161 bytes --]

On Thu, Sep 24, 2009 at 11:43:59PM +0200, Arnd Bergmann wrote:
> On Thursday 24 September 2009, Sukadev Bhattiprolu wrote:
> > +asmlinkage long
> > +sys_clone2(struct clone_struct __user *ucs, pid_t __user *pids)
> > +{
> > +       int rc;
> > +       struct clone_struct kcs;
> > +       unsigned long clone_flags;
> > +       int __user *parent_tid_ptr;
> > +       int __user *child_tid_ptr;
> > +       unsigned long __user child_stack_base;
> > +       struct pt_regs *regs;
> > +
> > +       rc = copy_from_user(&ucs, &kcs, sizeof(kcs));
> > +       if (rc)
> > +               return -EFAULT;
> > +
> > +       /*
> > +        * TODO: Convert clone_flags to 64-bit
> > +        */
> > +       clone_flags = (unsigned long)kcs.flags;
> > +       child_stack_base = (unsigned long)kcs.child_stack;
> > +       parent_tid_ptr = &ucs.parent_tid;
> > +       child_tid_ptr =  &ucs.child_tid;
> > +       regs = task_pt_regs(current);
> > +
> > +       if (!child_stack_base)
> > +               child_stack_base = user_stack_pointer(regs);
> > +
> > +       return do_fork_with_pids(clone_flags, child_stack_base, regs, 0,
> > +                       parent_tid_ptr, child_tid_ptr, kcs.nr_pids, pids);
> > +}
> > +
> 
> The function looks ok, but you have put it into arch/x86/kernel/process_32.c,
> which is specific to x86-32. Since the code in this form is generic, why
> not just put it into kernel/fork.c? You should probably enclose it within
> #ifdef CONFIG_HAVE_ARCH_TRACEHOOK to make sure that user_stack_pointer()
> is implemented, but then it would be immediately usable by the nine architectures
> implementing that. The other architectures can then decide between adding
> their private version of sys_clone2 with an open-coded user_stack_pointer
> implementation or alternatively implement the tracehooks.

It will very likely break ia64, which defines CONFIG_HAVE_ARCH_TRACEHOOK and
already has sys_clone2().

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-25  8:23     ` Louis Rilling
@ 2009-09-25 10:56       ` Louis Rilling
  2009-09-29 18:05         ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 45+ messages in thread
From: Louis Rilling @ 2009-09-25 10:56 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Containers, Nathan Lynch, linux-kernel, Eric W. Biederman, hpa,
	mingo, Sukadev Bhattiprolu, torvalds, Alexey Dobriyan,
	Pavel Emelyanov

[-- Attachment #1: Type: text/plain, Size: 2322 bytes --]

On 25/09/09 10:23 +0200, Louis Rilling wrote:
> On Thu, Sep 24, 2009 at 11:43:59PM +0200, Arnd Bergmann wrote:
> > On Thursday 24 September 2009, Sukadev Bhattiprolu wrote:
> > > +asmlinkage long
> > > +sys_clone2(struct clone_struct __user *ucs, pid_t __user *pids)
> > > +{
> > > +       int rc;
> > > +       struct clone_struct kcs;
> > > +       unsigned long clone_flags;
> > > +       int __user *parent_tid_ptr;
> > > +       int __user *child_tid_ptr;
> > > +       unsigned long __user child_stack_base;
> > > +       struct pt_regs *regs;
> > > +
> > > +       rc = copy_from_user(&ucs, &kcs, sizeof(kcs));
> > > +       if (rc)
> > > +               return -EFAULT;
> > > +
> > > +       /*
> > > +        * TODO: Convert clone_flags to 64-bit
> > > +        */
> > > +       clone_flags = (unsigned long)kcs.flags;
> > > +       child_stack_base = (unsigned long)kcs.child_stack;
> > > +       parent_tid_ptr = &ucs.parent_tid;
> > > +       child_tid_ptr =  &ucs.child_tid;
> > > +       regs = task_pt_regs(current);
> > > +
> > > +       if (!child_stack_base)
> > > +               child_stack_base = user_stack_pointer(regs);
> > > +
> > > +       return do_fork_with_pids(clone_flags, child_stack_base, regs, 0,
> > > +                       parent_tid_ptr, child_tid_ptr, kcs.nr_pids, pids);
> > > +}
> > > +
> > 
> > The function looks ok, but you have put it into arch/x86/kernel/process_32.c,
> > which is specific to x86-32. Since the code in this form is generic, why
> > not just put it into kernel/fork.c? You should probably enclose it within
> > #ifdef CONFIG_HAVE_ARCH_TRACEHOOK to make sure that user_stack_pointer()
> > is implemented, but then it would be immediately usable by the nine architectures
> > implementing that. The other architectures can then decide between adding
> > their private version of sys_clone2 with an open-coded user_stack_pointer
> > implementation or alternatively implement the tracehooks.
> 
> It will very likely break ia64, which defines CONFIG_HAVE_ARCH_TRACEHOOK and
> already has sys_clone2().

-> sys_clone_ext() ?

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-25 10:56       ` Louis Rilling
@ 2009-09-29 18:05         ` Sukadev Bhattiprolu
  2009-09-29 18:40           ` Roland McGrath
  2009-09-29 21:58           ` Oren Laadan
  0 siblings, 2 replies; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-09-29 18:05 UTC (permalink / raw)
  To: Arnd Bergmann, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, hpa, mingo, torvalds, Alexey Dobriyan,
	Pavel Emelyanov
  Cc: linux-api, kosaki.motohiro

Ccing kosaki.motohiro@jp.fujitsu.com and linux-api on this thread.

Louis Rilling [Louis.Rilling@kerlabs.com] wrote:
| > It will very likely break ia64, which defines CONFIG_HAVE_ARCH_TRACEHOOK and
| > already has sys_clone2().
| 
| -> sys_clone_ext() ?
| 
| Louis

How about spelling out extended and calling it clone_extended() ?

The other options I can think of are clone_with_pids() and clone3().

Thanks for your feedback.

Sukadev

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 18:05         ` Sukadev Bhattiprolu
@ 2009-09-29 18:40           ` Roland McGrath
  2009-09-29 18:44             ` H. Peter Anvin
  2009-09-29 21:58           ` Oren Laadan
  1 sibling, 1 reply; 45+ messages in thread
From: Roland McGrath @ 2009-09-29 18:40 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Arnd Bergmann, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, hpa, mingo, torvalds, Alexey Dobriyan,
	Pavel Emelyanov, linux-api, kosaki.motohiro

Why add a new syscall at all instead of just using a new CLONE_* flag to
indicate that the argument layout is different?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 18:40           ` Roland McGrath
@ 2009-09-29 18:44             ` H. Peter Anvin
  2009-09-29 19:02               ` Arjan van de Ven
  0 siblings, 1 reply; 45+ messages in thread
From: H. Peter Anvin @ 2009-09-29 18:44 UTC (permalink / raw)
  To: Roland McGrath
  Cc: Sukadev Bhattiprolu, Arnd Bergmann, Containers, Nathan Lynch,
	linux-kernel, Eric W. Biederman, mingo, torvalds, Alexey Dobriyan,
	Pavel Emelyanov, linux-api, kosaki.motohiro

On 09/29/2009 11:40 AM, Roland McGrath wrote:
> Why add a new syscall at all instead of just using a new CLONE_* flag to
> indicate that the argument layout is different?

What an absolutely atrociously bad idea.

We already have a syscall layer which is painful to thunk in places, and
this would make it much worse.

	-hpa


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 18:44             ` H. Peter Anvin
@ 2009-09-29 19:02               ` Arjan van de Ven
  2009-09-29 19:10                 ` Linus Torvalds
  2009-09-29 20:00                 ` H. Peter Anvin
  0 siblings, 2 replies; 45+ messages in thread
From: Arjan van de Ven @ 2009-09-29 19:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Roland McGrath, Sukadev Bhattiprolu, Arnd Bergmann, Containers,
	Nathan Lynch, linux-kernel, Eric W. Biederman, mingo, torvalds,
	Alexey Dobriyan, Pavel Emelyanov, linux-api, kosaki.motohiro

On Tue, 29 Sep 2009 11:44:52 -0700
"H. Peter Anvin" <hpa@zytor.com> wrote:

> On 09/29/2009 11:40 AM, Roland McGrath wrote:
> > Why add a new syscall at all instead of just using a new CLONE_*
> > flag to indicate that the argument layout is different?
> 
> What an absolutely atrociously bad idea.
> 
> We already have a syscall layer which is painful to thunk in places,
> and this would make it much worse.
> 

syscalls are cheap as well.
cheaper than decades of dealing with such multiplexer mess ;/


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 19:02               ` Arjan van de Ven
@ 2009-09-29 19:10                 ` Linus Torvalds
  2009-09-29 20:02                   ` H. Peter Anvin
  2009-09-29 20:00                 ` H. Peter Anvin
  1 sibling, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2009-09-29 19:10 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: H. Peter Anvin, Roland McGrath, Sukadev Bhattiprolu,
	Arnd Bergmann, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, mingo, Alexey Dobriyan, Pavel Emelyanov,
	linux-api, kosaki.motohiro



On Tue, 29 Sep 2009, Arjan van de Ven wrote:
> > 
> > We already have a syscall layer which is painful to thunk in places,
> > and this would make it much worse.
> 
> syscalls are cheap as well.
> cheaper than decades of dealing with such multiplexer mess ;/

Well, I'd agree, except the clone flags really _are_ about multiplexer 
issues, and the new flag woudln't really change anything. 

If the new system call actually had appreciably separate code-paths, I'd 
buy the "multiplexer" argument. But it doesn't really. It's going to call 
down to the same basic clone functionality, and the core clone code ends 
up de-multiplexing the cases anyway.

So this would not at all be like the socket calls (to pick the traditional 
Linux system call multiplexing example) in that sense.

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 19:02               ` Arjan van de Ven
  2009-09-29 19:10                 ` Linus Torvalds
@ 2009-09-29 20:00                 ` H. Peter Anvin
  1 sibling, 0 replies; 45+ messages in thread
From: H. Peter Anvin @ 2009-09-29 20:00 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Roland McGrath, Sukadev Bhattiprolu, Arnd Bergmann, Containers,
	Nathan Lynch, linux-kernel, Eric W. Biederman, mingo, torvalds,
	Alexey Dobriyan, Pavel Emelyanov, linux-api, kosaki.motohiro

On 09/29/2009 12:02 PM, Arjan van de Ven wrote:
> On Tue, 29 Sep 2009 11:44:52 -0700
> "H. Peter Anvin" <hpa@zytor.com> wrote:
> 
>> On 09/29/2009 11:40 AM, Roland McGrath wrote:
>>> Why add a new syscall at all instead of just using a new CLONE_*
>>> flag to indicate that the argument layout is different?
>>
>> What an absolutely atrociously bad idea.
>>
>> We already have a syscall layer which is painful to thunk in places,
>> and this would make it much worse.
>>
> syscalls are cheap as well.
> cheaper than decades of dealing with such multiplexer mess ;/
> 

It really comes down to wanting all the dispatch to happen in one
central place.

	-hpa


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 19:10                 ` Linus Torvalds
@ 2009-09-29 20:02                   ` H. Peter Anvin
  2009-09-29 22:11                     ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: H. Peter Anvin @ 2009-09-29 20:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Roland McGrath, Sukadev Bhattiprolu,
	Arnd Bergmann, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, mingo, Alexey Dobriyan, Pavel Emelyanov,
	linux-api, kosaki.motohiro

On 09/29/2009 12:10 PM, Linus Torvalds wrote:
> 
> On Tue, 29 Sep 2009, Arjan van de Ven wrote:
>>>
>>> We already have a syscall layer which is painful to thunk in places,
>>> and this would make it much worse.
>>
>> syscalls are cheap as well.
>> cheaper than decades of dealing with such multiplexer mess ;/
> 
> Well, I'd agree, except the clone flags really _are_ about multiplexer 
> issues, and the new flag woudln't really change anything. 
> 
> If the new system call actually had appreciably separate code-paths, I'd 
> buy the "multiplexer" argument. But it doesn't really. It's going to call 
> down to the same basic clone functionality, and the core clone code ends 
> up de-multiplexing the cases anyway.
> 
> So this would not at all be like the socket calls (to pick the traditional 
> Linux system call multiplexing example) in that sense.
> 

That's not the main issue here, though.  The main issue is that the
prototype of the function now depends on one of its arguments, which is
absolute hell for anything that needs to thunk arguments in a systematic
way, which we have to do on several architectures, and which would be
useful to be able to do for others, too.

	-hpa


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 18:05         ` Sukadev Bhattiprolu
  2009-09-29 18:40           ` Roland McGrath
@ 2009-09-29 21:58           ` Oren Laadan
  1 sibling, 0 replies; 45+ messages in thread
From: Oren Laadan @ 2009-09-29 21:58 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Arnd Bergmann, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, hpa, mingo, torvalds, Alexey Dobriyan,
	Pavel Emelyanov, linux-api, kosaki.motohiro



Sukadev Bhattiprolu wrote:
> Ccing kosaki.motohiro@jp.fujitsu.com and linux-api on this thread.
> 
> Louis Rilling [Louis.Rilling@kerlabs.com] wrote:
> | > It will very likely break ia64, which defines CONFIG_HAVE_ARCH_TRACEHOOK and
> | > already has sys_clone2().
> | 
> | -> sys_clone_ext() ?
> | 
> | Louis
> 
> How about spelling out extended and calling it clone_extended() ?
> 
> The other options I can think of are clone_with_pids() and clone3().

I like clone3(), or clone_new() ?

or even better -- how about xerox()  :p

Oren.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 20:02                   ` H. Peter Anvin
@ 2009-09-29 22:11                     ` Linus Torvalds
  2009-09-29 22:19                       ` H. Peter Anvin
  2009-09-30  6:48                       ` Roland McGrath
  0 siblings, 2 replies; 45+ messages in thread
From: Linus Torvalds @ 2009-09-29 22:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Arjan van de Ven, Roland McGrath, Sukadev Bhattiprolu,
	Arnd Bergmann, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, mingo, Alexey Dobriyan, Pavel Emelyanov,
	linux-api, kosaki.motohiro



On Tue, 29 Sep 2009, H. Peter Anvin wrote:
> 
> That's not the main issue here, though.  The main issue is that the
> prototype of the function now depends on one of its arguments

Ok, I agree with that. The kernel side is easy (we have magic calling 
conventions there and need to turn registers into arguments anyway before 
you get to the shared code), but your point about the user side prototype 
is valid.

However, that could easily be handled by just having a extended_clone() 
prototype that then sets the CLONE_EXTINFO (or whatever) bit in the flags. 
I think most of the time the clone() stuff needs special user-level 
wrappers anyway to handle the stack setup etc, no?

In other words, what I'd suggest we could do is

 - the kernel "do_fork()" interface would be made to have the "extended" 
   format by default - so the _kernel_ never has two formats in its 
   generic logic.

 - the "sys_clone()" system call, that already needs to munge the user 
   mode registers into the "do_fork()" format, would be the one that 
   recognizes the new flag and copies the extended data from user mode 
   memory to the extended info mode.

Then each architecture would need to update it's "sys_clone()" function to 
take advantage of the new extended format, but that's something that the 
new system call would have had to do anyway, so that's not an added burden 
in any way.

Hmm?

I don't feel horribly strongly about this, and as far as I'm concerned 
it's fine to also do it as a new system call too (we already have 'fork()' 
and 'vfork()' as special case interfaces to do_fork() - the new 'extended 
clone' would be no different).

I just think that Roland is correct that if the new extended fork handles 
the "no new info" case itself _anyway_, then there is no upside to making 
it a new system call, since the complexity is the same as just extending 
the old one.

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 22:11                     ` Linus Torvalds
@ 2009-09-29 22:19                       ` H. Peter Anvin
  2009-09-30 16:15                         ` Arnd Bergmann
  2009-09-30  6:48                       ` Roland McGrath
  1 sibling, 1 reply; 45+ messages in thread
From: H. Peter Anvin @ 2009-09-29 22:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Roland McGrath, Sukadev Bhattiprolu,
	Arnd Bergmann, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, mingo, Alexey Dobriyan, Pavel Emelyanov,
	linux-api, kosaki.motohiro

On 09/29/2009 03:11 PM, Linus Torvalds wrote:
> 
> Ok, I agree with that. The kernel side is easy (we have magic calling 
> conventions there and need to turn registers into arguments anyway before 
> you get to the shared code), but your point about the user side prototype 
> is valid.
> 

I think it would also apply to kernel-side munging.  It's quite possibly
you're right in that clone is such a special case anyway, but it seems
pointless to make it more special in the short bus sort of way even if
it is possible.

Let's just make it another system call.  It doesn't have any downside
that I can see, might prevent problems, and avoids setting a bad
precedent that someone can misinterpret.

	-hpa


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-24 18:35   ` Serge E. Hallyn
@ 2009-09-30  5:34     ` Alexey Dobriyan
  2009-09-30 17:41       ` Oren Laadan
  0 siblings, 1 reply; 45+ messages in thread
From: Alexey Dobriyan @ 2009-09-30  5:34 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Sukadev Bhattiprolu, linux-kernel, Oren Laadan, Eric W. Biederman,
	Pavel Emelyanov, Andrew Morton, torvalds, mikew, mingo, hpa,
	Nathan Lynch, arnd, peterz, Containers, sukadev

On Thu, Sep 24, 2009 at 01:35:56PM -0500, Serge E. Hallyn wrote:
> Quoting Alexey Dobriyan (adobriyan@gmail.com):
> > I don't like this even more.
> > 
> > Pid namespaces are hierarchical _and_ anonymous, so simply
> > set of numbers doesn't describe the final object.
> > 
> > struct pid isn't special, it's just another invariant if you like
> > as far as C/R is concerned, but system call is made special wrt pids.
> > 
> > What will be in an image? I hope "struct kstate_image_pid" with several
> 
> Sure pid namespaces are anonymous, but we will give each an 'objref'
> valid only for a checkpoint image, and store the relationship between
> pid namespaces based on those objrefs.  Basically the same way that user
> structs and hierarchical user namespaces are handled right now.

OK, that's certainly doable.

You're commiting yourself to creation of tasks in userspace if this goes in. :-\
Which can let you into putting wrong kind of relations into image.
IIRC, clone_flags were in image (still?), but tomorrow kernel will get
new way to acquire, say, uts_ns, which, in theory, can't be described by
a set of consecutive clones, so, you'll have to fixup something in kernel.

> > numbers and with references to such object from other places, so it
> > seems natural to do alloc_pid() with needed numbers and that attach new
> > shiny pid to where needed. But this clone_pid is only for task_struct's pids.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 22:11                     ` Linus Torvalds
  2009-09-29 22:19                       ` H. Peter Anvin
@ 2009-09-30  6:48                       ` Roland McGrath
  1 sibling, 0 replies; 45+ messages in thread
From: Roland McGrath @ 2009-09-30  6:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Arjan van de Ven, Sukadev Bhattiprolu,
	Arnd Bergmann, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, mingo, Alexey Dobriyan, Pavel Emelyanov,
	linux-api, kosaki.motohiro

The glibc prototype for clone uses ... and is not a direct map to the
syscall args anyway.  So that would not change for adding more optional
args enabled by certain flags, as it did not change to add the tid pointer
arguments before.  But indeed the library function would have to change to
pass on additional or different args to the existing syscall.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-29 22:19                       ` H. Peter Anvin
@ 2009-09-30 16:15                         ` Arnd Bergmann
  2009-09-30 16:27                           ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Arnd Bergmann @ 2009-09-30 16:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Arjan van de Ven, Roland McGrath,
	Sukadev Bhattiprolu, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, mingo, Alexey Dobriyan, Pavel Emelyanov,
	linux-api, kosaki.motohiro

On Wednesday 30 September 2009, H. Peter Anvin wrote:
> Let's just make it another system call.  It doesn't have any downside
> that I can see, might prevent problems, and avoids setting a bad
> precedent that someone can misinterpret.

One more argument for this is that the new code is architecture independent
using user_stack_pointer(), while the original sys_clone is highly
architecture specific, which is a source for bugs when trying to
extend it.

	Arnd <><

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-30 16:15                         ` Arnd Bergmann
@ 2009-09-30 16:27                           ` Linus Torvalds
  2009-09-30 17:59                             ` Arnd Bergmann
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2009-09-30 16:27 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: H. Peter Anvin, Arjan van de Ven, Roland McGrath,
	Sukadev Bhattiprolu, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, mingo, Alexey Dobriyan, Pavel Emelyanov,
	linux-api, kosaki.motohiro



On Wed, 30 Sep 2009, Arnd Bergmann wrote:
> 
> One more argument for this is that the new code is architecture independent
> using user_stack_pointer(), while the original sys_clone is highly
> architecture specific, which is a source for bugs when trying to
> extend it.

Umm. I don't think that is possible.

You need architecture-specific code to even get access to all registers to 
copy and get a signal-handler-compatible stack frame. See for example 
arch/alpha/kernel/entry.S with the switch-stack thing etc.  I don't think 
there is any way to make that even remotely architecture-neutral.

			Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-30  5:34     ` Alexey Dobriyan
@ 2009-09-30 17:41       ` Oren Laadan
  2009-10-02 20:27         ` Alexey Dobriyan
  0 siblings, 1 reply; 45+ messages in thread
From: Oren Laadan @ 2009-09-30 17:41 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Serge E. Hallyn, arnd, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, hpa, mingo, Sukadev Bhattiprolu, torvalds,
	Pavel Emelyanov



Alexey Dobriyan wrote:
> On Thu, Sep 24, 2009 at 01:35:56PM -0500, Serge E. Hallyn wrote:
>> Quoting Alexey Dobriyan (adobriyan@gmail.com):
>>> I don't like this even more.
>>>
>>> Pid namespaces are hierarchical _and_ anonymous, so simply
>>> set of numbers doesn't describe the final object.
>>>
>>> struct pid isn't special, it's just another invariant if you like
>>> as far as C/R is concerned, but system call is made special wrt pids.
>>>
>>> What will be in an image? I hope "struct kstate_image_pid" with several
>> Sure pid namespaces are anonymous, but we will give each an 'objref'
>> valid only for a checkpoint image, and store the relationship between
>> pid namespaces based on those objrefs.  Basically the same way that user
>> structs and hierarchical user namespaces are handled right now.
> 
> OK, that's certainly doable.
> 
> You're commiting yourself to creation of tasks in userspace if this goes in. :-\
> Which can let you into putting wrong kind of relations into image.

A malicious user can put "wrong" king of relations into the image,
regardless of whether the tasks are created in the kernel or in
userspace. As long as the creation follows the "instructions" in
the image, the result would be the same.

> IIRC, clone_flags were in image (still?), but tomorrow kernel will get
> new way to acquire, say, uts_ns, which, in theory, can't be described by
> a set of consecutive clones, so, you'll have to fixup something in kernel.

The only thing enforced in user space is task relations, threads
and (as a by-product) session id's.  The rest are refined in the
kernel. This includes uts_ns, for example.

(FWIW, _any_ clone relationships can be described by a set of
clones. In particular because that's how they were constructed
in the first place).

Oren.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-30 16:27                           ` Linus Torvalds
@ 2009-09-30 17:59                             ` Arnd Bergmann
  2009-09-30 19:14                               ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Arnd Bergmann @ 2009-09-30 17:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Arjan van de Ven, Roland McGrath,
	Sukadev Bhattiprolu, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, mingo, Alexey Dobriyan, Pavel Emelyanov,
	linux-api, kosaki.motohiro

On Wednesday 30 September 2009, Linus Torvalds wrote:
> Umm. I don't think that is possible.
> 
> You need architecture-specific code to even get access to all registers to 
> copy and get a signal-handler-compatible stack frame. See for example 
> arch/alpha/kernel/entry.S with the switch-stack thing etc.  I don't think 
> there is any way to make that even remotely architecture-neutral.

Right, you still need to save all the registers from the entry code.
I was under the wrong assumption that task_pt_regs(current)
would give the full register set on all architectures.

However, I'd still hope that a new system call can be defined in
a way that you only need to have an assembly wrapper to save
the full pt_regs, but no arch specific code to get the syscall arguments
out of that again. In do_clone(), you need a pointer to pt_regs and
the user stack pointer, but that can be generated from
user_stack_pointer(regs).

Does task_pt_regs(current) give the right pointer on all architectures
or do we also need to pass the regs into the syscall?

	Arnd <><

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 8/9]: Define clone2() syscall
  2009-09-30 17:59                             ` Arnd Bergmann
@ 2009-09-30 19:14                               ` Linus Torvalds
  0 siblings, 0 replies; 45+ messages in thread
From: Linus Torvalds @ 2009-09-30 19:14 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: H. Peter Anvin, Arjan van de Ven, Roland McGrath,
	Sukadev Bhattiprolu, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, mingo, Alexey Dobriyan, Pavel Emelyanov,
	linux-api, kosaki.motohiro



On Wed, 30 Sep 2009, Arnd Bergmann wrote:
> 
> Right, you still need to save all the registers from the entry code.
> I was under the wrong assumption that task_pt_regs(current)
> would give the full register set on all architectures.
> 
> However, I'd still hope that a new system call can be defined in
> a way that you only need to have an assembly wrapper to save
> the full pt_regs, but no arch specific code to get the syscall arguments
> out of that again. In do_clone(), you need a pointer to pt_regs and
> the user stack pointer, but that can be generated from
> user_stack_pointer(regs).

I don't think it can. You don't know what the system call stack layout is. 

> Does task_pt_regs(current) give the right pointer on all architectures
> or do we also need to pass the regs into the syscall?

I do not believe that it gives the right pointer in general. In fact, I 
can guarantee it doesn't. Even on x86 it only works for certain contexts 
(non-vm86 mode at a minimum), and on architectures like alpha it's not at 
all sufficient, because even if you can locate the 'pt_regs' structure, 
you _also_ need the extra guarantees of the pt_regs being next to the 
extended signal state register structure - and that only happens for magic 
sequences like signal handling and explicit setups like fork/clone.

So I do repeat: if you think you can do all of this in generic code, then 
you're sadly and totally mistaken. Don't even try. It may work on some 
architectures, but it's simply fundamentally _wrong_.

		Linus

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-24 17:44 ` [RFC][v7][PATCH 0/9] Implement clone2() system call Oren Laadan
  2009-09-24 20:15   ` Sukadev Bhattiprolu
@ 2009-10-01  2:36   ` Sukadev Bhattiprolu
  2009-10-01 15:19     ` Oren Laadan
  1 sibling, 1 reply; 45+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-01  2:36 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-kernel, arnd, Containers, Nathan Lynch, Eric W. Biederman,
	hpa, mingo, torvalds, Alexey Dobriyan, Pavel Emelyanov

Oren Laadan [orenl@librato.com] wrote:
| 
| 
| Sukadev Bhattiprolu wrote:
| > === NEW CLONE() SYSTEM CALL:
| > 
| > To support application checkpoint/restart, a task must have the same pid it
| > had when it was checkpointed.  When containers are nested, the tasks within
| > the containers exist in multiple pid namespaces and hence have multiple pids
| > to specify during restart.
| > 
| > This patchset implements a new system call, clone2() that lets a process
| > specify the pids of the child process.
| > 
| > Patches 1 through 6 are helper patches, needed for choosing a pid for the
| > child process.
| > 
| > Patch 8 defines a prototype of the new system call. Patch 9 adds some
| > documentation on the new system call, some/all of which will eventually
| > go into a man page.
| > 
| 
| [...]
| 
| > 
| > Based on these requirements and constraints, we explored a couple of system
| > call interfaces (in earlier versions of this patchset) and currently define
| > the system call as:
| > 
| > 	struct clone_struct {
| > 		u64 flags;
| > 		u64 child_stack;
| > 		u32 nr_pids;
| > 		u32 parent_tid;
| > 		u32 child_tid;
| 
| So @parent_tid and @child_tid are pointers to userspace memory and
| require 'u64' (and it won't hurt to make @reserved1 a 'u64' as well).

Well, if we make parent_tid and child_tid u64, we could move reserved1
after ->nr_pids and leave it as a 32-bit value.

| 
| > 		u32 reserved1;
| > 		u64 reserved2;
| > 	};
| > 
| 
| Also, for forward/backward compatibility, explicitly state in the
| documentation, and enforce in the kernel, that flags which are not
| defined must not be set, and that reserved{1,2} must remain 0.

Agree with checking for reserved1 and reserved2.

We currently don't check for invalid clone_flags - we just ignore them.
Adding checks like

	if (fls(kcs.flags) > fls(CLONE_LAST_FLAG))

would assume we always use bits in order (while it seems to make sense, to
use them in order, we don't seem to have done so in the past).

Alternatively we could define a CLONE_FLAG_MASK of valid flags and update
the mask when each new clone flag is added. 

But do we really need to check for invalid flags ?

Sukadev

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-10-01  2:36   ` Sukadev Bhattiprolu
@ 2009-10-01 15:19     ` Oren Laadan
  0 siblings, 0 replies; 45+ messages in thread
From: Oren Laadan @ 2009-10-01 15:19 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: linux-kernel, arnd, Containers, Nathan Lynch, Eric W. Biederman,
	hpa, mingo, torvalds, Alexey Dobriyan, Pavel Emelyanov



Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl@librato.com] wrote:
> | 
> | 
> | Sukadev Bhattiprolu wrote:
> | > === NEW CLONE() SYSTEM CALL:
> | > 
> | > To support application checkpoint/restart, a task must have the same pid it
> | > had when it was checkpointed.  When containers are nested, the tasks within
> | > the containers exist in multiple pid namespaces and hence have multiple pids
> | > to specify during restart.
> | > 
> | > This patchset implements a new system call, clone2() that lets a process
> | > specify the pids of the child process.
> | > 
> | > Patches 1 through 6 are helper patches, needed for choosing a pid for the
> | > child process.
> | > 
> | > Patch 8 defines a prototype of the new system call. Patch 9 adds some
> | > documentation on the new system call, some/all of which will eventually
> | > go into a man page.
> | > 
> | 
> | [...]
> | 
> | > 
> | > Based on these requirements and constraints, we explored a couple of system
> | > call interfaces (in earlier versions of this patchset) and currently define
> | > the system call as:
> | > 
> | > 	struct clone_struct {
> | > 		u64 flags;
> | > 		u64 child_stack;
> | > 		u32 nr_pids;
> | > 		u32 parent_tid;
> | > 		u32 child_tid;
> | 
> | So @parent_tid and @child_tid are pointers to userspace memory and
> | require 'u64' (and it won't hurt to make @reserved1 a 'u64' as well).
> 
> Well, if we make parent_tid and child_tid u64, we could move reserved1
> after ->nr_pids and leave it as a 32-bit value.

Sure. In any case, won't hurt to leave large reserved space -
someone may be thankful for it in the future ;)

> 
> | 
> | > 		u32 reserved1;
> | > 		u64 reserved2;
> | > 	};
> | > 
> | 
> | Also, for forward/backward compatibility, explicitly state in the
> | documentation, and enforce in the kernel, that flags which are not
> | defined must not be set, and that reserved{1,2} must remain 0.
> 
> Agree with checking for reserved1 and reserved2.
> 
> We currently don't check for invalid clone_flags - we just ignore them.
> Adding checks like
> 
> 	if (fls(kcs.flags) > fls(CLONE_LAST_FLAG))
> 
> would assume we always use bits in order (while it seems to make sense, to
> use them in order, we don't seem to have done so in the past).
> 
> Alternatively we could define a CLONE_FLAG_MASK of valid flags and update
> the mask when each new clone flag is added. 
> 
> But do we really need to check for invalid flags ?

I'd go for a a mask.

The idea is that we want to educate userspace to _not_ use unused
flags now. For if userspace sets an unused flag now and we let it
be, the application will break when we give meaning to that flag.

Oren.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-09-30 17:41       ` Oren Laadan
@ 2009-10-02 20:27         ` Alexey Dobriyan
  2009-10-02 21:06           ` Oren Laadan
  0 siblings, 1 reply; 45+ messages in thread
From: Alexey Dobriyan @ 2009-10-02 20:27 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Serge E. Hallyn, arnd, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, hpa, mingo, Sukadev Bhattiprolu, torvalds,
	Pavel Emelyanov

On Wed, Sep 30, 2009 at 01:41:45PM -0400, Oren Laadan wrote:
> Alexey Dobriyan wrote:
> > On Thu, Sep 24, 2009 at 01:35:56PM -0500, Serge E. Hallyn wrote:
> >> Quoting Alexey Dobriyan (adobriyan@gmail.com):
> >>> I don't like this even more.
> >>>
> >>> Pid namespaces are hierarchical _and_ anonymous, so simply
> >>> set of numbers doesn't describe the final object.
> >>>
> >>> struct pid isn't special, it's just another invariant if you like
> >>> as far as C/R is concerned, but system call is made special wrt pids.
> >>>
> >>> What will be in an image? I hope "struct kstate_image_pid" with several
> >> Sure pid namespaces are anonymous, but we will give each an 'objref'
> >> valid only for a checkpoint image, and store the relationship between
> >> pid namespaces based on those objrefs.  Basically the same way that user
> >> structs and hierarchical user namespaces are handled right now.
> > 
> > OK, that's certainly doable.
> > 
> > You're commiting yourself to creation of tasks in userspace if this goes in. :-\
> > Which can let you into putting wrong kind of relations into image.
> 
> A malicious user can put "wrong" king of relations into the image,
> regardless of whether the tasks are created in the kernel or in
> userspace. As long as the creation follows the "instructions" in
> the image, the result would be the same.

Wrong as in "fundamentally wrong", not malicious.
In case of uts_ns, the correct data to put into image is "task uses this uts_ns",
not "at this point do clone(CLONE_NEWUTS)".

BTW, now I'm convinced that nsproxy should not even be mentioned be in an image,
it's irrelevant technical detail, not future-proof at all.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC][v7][PATCH 0/9] Implement clone2() system call
  2009-10-02 20:27         ` Alexey Dobriyan
@ 2009-10-02 21:06           ` Oren Laadan
  0 siblings, 0 replies; 45+ messages in thread
From: Oren Laadan @ 2009-10-02 21:06 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Serge E. Hallyn, arnd, Containers, Nathan Lynch, linux-kernel,
	Eric W. Biederman, hpa, mingo, Sukadev Bhattiprolu, torvalds,
	Pavel Emelyanov



Alexey Dobriyan wrote:
> On Wed, Sep 30, 2009 at 01:41:45PM -0400, Oren Laadan wrote:
>> Alexey Dobriyan wrote:
>>> On Thu, Sep 24, 2009 at 01:35:56PM -0500, Serge E. Hallyn wrote:
>>>> Quoting Alexey Dobriyan (adobriyan@gmail.com):
>>>>> I don't like this even more.
>>>>>
>>>>> Pid namespaces are hierarchical _and_ anonymous, so simply
>>>>> set of numbers doesn't describe the final object.
>>>>>
>>>>> struct pid isn't special, it's just another invariant if you like
>>>>> as far as C/R is concerned, but system call is made special wrt pids.
>>>>>
>>>>> What will be in an image? I hope "struct kstate_image_pid" with several
>>>> Sure pid namespaces are anonymous, but we will give each an 'objref'
>>>> valid only for a checkpoint image, and store the relationship between
>>>> pid namespaces based on those objrefs.  Basically the same way that user
>>>> structs and hierarchical user namespaces are handled right now.
>>> OK, that's certainly doable.
>>>
>>> You're commiting yourself to creation of tasks in userspace if this goes in. :-\
>>> Which can let you into putting wrong kind of relations into image.
>> A malicious user can put "wrong" king of relations into the image,
>> regardless of whether the tasks are created in the kernel or in
>> userspace. As long as the creation follows the "instructions" in
>> the image, the result would be the same.
> 
> Wrong as in "fundamentally wrong", not malicious.
> In case of uts_ns, the correct data to put into image is "task uses this uts_ns",
> not "at this point do clone(CLONE_NEWUTS)".

So we are in total agreement: that's how it is done now.

Only task creation per-se, including pid-ns (future work) is done
in userspace. Network namespaces will probably be created in userspace
but attached to tasks in the kernel. Remaining namespaces are covered
in the kernel the way you described.

> 
> BTW, now I'm convinced that nsproxy should not even be mentioned be in an image,
> it's irrelevant technical detail, not future-proof at all.

It's helpful (as is more efficient) to keep it now. We can always
decide to ignore it in the future.

Thanks,

Oren.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2009-10-02 21:06 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-24 16:55 [RFC][v7][PATCH 0/9] Implement clone2() system call Sukadev Bhattiprolu
2009-09-24 17:00 ` [RFC][v7][PATCH 1/9]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
2009-09-24 17:00 ` [RFC][v7][PATCH 2/9]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
2009-09-24 17:01 ` [RFC][v7][PATCH 3/9] Make pid_max a pid_ns property Sukadev Bhattiprolu
2009-09-24 17:45   ` Oren Laadan
2009-09-24 17:01 ` [RFC][v7][PATCH 4/9]: Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
2009-09-24 17:47   ` Oren Laadan
2009-09-24 17:01 ` [RFC][v7][PATCH 5/9]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
2009-09-24 17:02 ` [RFC][v7][PATCH 6/9]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2009-09-24 17:02 ` [RFC][v7][PATCH 7/9]: Define do_fork_with_pids() Sukadev Bhattiprolu
2009-09-24 17:03 ` [RFC][v7][PATCH 8/9]: Define clone2() syscall Sukadev Bhattiprolu
2009-09-24 21:43   ` Arnd Bergmann
2009-09-25  8:23     ` Louis Rilling
2009-09-25 10:56       ` Louis Rilling
2009-09-29 18:05         ` Sukadev Bhattiprolu
2009-09-29 18:40           ` Roland McGrath
2009-09-29 18:44             ` H. Peter Anvin
2009-09-29 19:02               ` Arjan van de Ven
2009-09-29 19:10                 ` Linus Torvalds
2009-09-29 20:02                   ` H. Peter Anvin
2009-09-29 22:11                     ` Linus Torvalds
2009-09-29 22:19                       ` H. Peter Anvin
2009-09-30 16:15                         ` Arnd Bergmann
2009-09-30 16:27                           ` Linus Torvalds
2009-09-30 17:59                             ` Arnd Bergmann
2009-09-30 19:14                               ` Linus Torvalds
2009-09-30  6:48                       ` Roland McGrath
2009-09-29 20:00                 ` H. Peter Anvin
2009-09-29 21:58           ` Oren Laadan
2009-09-24 17:03 ` [RFC][v7][PATCH 9/9]: Document " Sukadev Bhattiprolu
2009-09-24 18:05   ` Randy Dunlap
2009-09-25  2:31   ` KOSAKI Motohiro
2009-09-24 17:44 ` [RFC][v7][PATCH 0/9] Implement clone2() system call Oren Laadan
2009-09-24 20:15   ` Sukadev Bhattiprolu
2009-09-24 22:06     ` Oren Laadan
2009-09-24 22:21       ` Arnd Bergmann
2009-09-24 23:19         ` Oren Laadan
2009-10-01  2:36   ` Sukadev Bhattiprolu
2009-10-01 15:19     ` Oren Laadan
2009-09-24 17:57 ` Alexey Dobriyan
2009-09-24 18:35   ` Serge E. Hallyn
2009-09-30  5:34     ` Alexey Dobriyan
2009-09-30 17:41       ` Oren Laadan
2009-10-02 20:27         ` Alexey Dobriyan
2009-10-02 21:06           ` Oren Laadan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox