linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v9][PATCH 0/9] Implement clone3() system call
@ 2009-10-25  3:35 Sukadev Bhattiprolu
       [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-25  3:35 UTC (permalink / raw)
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	matthltc-r/Jw6+rmf7HQT0dZR+AlfA, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Well, there is been no consensus on the name for the new system call.
Michael Kerrisk did a lot of research on how we named such system calls
trying to identify a convention in the chaos :-) and proposed clonex()
or clone_extended(). Peter Anvin, Oren Laadan and I prefer clone3()
which reflects the number of arguments to the new system call.

Besides that, I believe I have addressed all review comments so far.

The patchset seems to work fine on x86. An earlier version of the patchset
was ported to ppc64 and work is in progress to port this version too.
---

[v9][PATCH 0/9] Implement clone3() system call

To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed.  When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.

This patchset implements a new system call, clone3() that lets a process
specify the pids of the child process.

Patches 1 through 6 are helper patches needed for choosing a pid for the
child process.

PATCH 8 defines a prototype of the new system call. PATCH 9 adds some
documentation on the new system call, some/all of which will eventually
go into a man page.

Changelog[v9]:
	- [Pavel Emelyanov] Drop the patch that made 'pid_max' a property
	  of struct pid_namespace
	- [Roland McGrath, H. Peter Anvin and earlier on, Serge Hallyn] To
	  avoid inadvertent truncation clone_flags, preserve the first
	  parameter of clone3() as 'u32 clone_flags' and specify newer
	  flags in clone_args.flags_high (PATCH 8/9 and PATCH 9/9)
	- [Eric Biederman] Generalize alloc_pidmap() code to simplify and
	  remove duplication (see PATCH 3/9].
	  
Changelog[v8]:
	- [Oren Laadan, Louis Rilling, KOSAKI Motohiro]
	  The name 'clone2()' is in use - renamed new syscall to clone3().
	- [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit.
	- [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0.
	  (Added [PATCH 7/10] to the patchset).

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann]
	  Group the arguments to clone2() into a 'struct clone_arg' to
	  workaround the issue of exceeding 6 arguments to the system call.
	  Also define clone-flags as u64 to allow additional clone-flags.

Changelog[v6]:
	- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
	  Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
	  constant across architectures (Patches 7, 8).
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check (Patches 7,8)
	- (Pavel Machek) New patch (Patch 9) to add some documentation.

Changelog[v5]:
	- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
	  into this set)
	- (Eric Biederman): Avoid the new function, set_pidmap() - added
	  couple of checks on 'target_pid' in alloc_pidmap() itself.

=== IMPORTANT NOTE:

clone() system call has another limitation - all but one, available bits in
clone-flags are in use and if more new clone-flags are needed, we will need
a variant of the clone() system call. 

It appears to make sense to try and extend this new system call to address
this limitation as well. The requirements of a new clone system call could
then be summarized as:

	- do everything clone() does today, and
	- give application an ability to choose pids for the child process
	  in all ancestor pid namespaces, and
	- allow more clone_flags

Contstraints:

	- system-calls are restricted to 6 parameters and clone() already
	  takes 5 parameters, any extension to clone() interface would require
	  one or more copy_from_user().  (Not sure if copy_from_user() of ~40
	  bytes would have a significant impact on performance of clone()).

Based on these requirements and constraints, we explored a couple of system
call interfaces (in earlier versions of this patchset).  Based on input from
Arnd Bergmann and others, the new interface of the system call is: 

	struct clone_args {
		u64 clone_flags_high;
		u64 child_stack_base;
		u64 child_stack_size;
		u64 parent_tid_ptr;
		u64 child_tid_ptr;
		u32 nr_pids;
		u32 clone_args_size;
		u64 reserved1;
	};

	sys_clone3(u32 flags_low, struct clone_args *cargs, pid_t *pids)

Details of the struct clone_args and the usage are explained in the
documentation (PATCH 9/9).

NOTE:
	While this patchset enables support for more clone-flags, actual
	implementation for additional clone-flags is best implemented as
	a separate patchset (PATCH 8/9 identifies some TODOs)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [v9][PATCH 1/9] Factor out code to allocate pidmap page
       [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-25  3:37   ` Sukadev Bhattiprolu
  2009-10-25  3:37   ` [v9][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-25  3:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	matthltc-r/Jw6+rmf7HQT0dZR+AlfA, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Subject: [v9][PATCH 1/9] Factor out code to allocate pidmap page

To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.

Changelog[v3]:
	- Earlier version of patchset called alloc_pidmap_page() from two
	  places. But now its called from only one place. Even so, moving
	  this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/pid.c |   45 ++++++++++++++++++++++++++++++---------------
 1 files changed, 30 insertions(+), 15 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index d3f722d..7d4bb6e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (map->page)
+		kfree(page);
+	else
+		map->page = page;
+	spin_unlock_irq(&pidmap_lock);
+
+	if (unlikely(!map->page))
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int rc;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		rc = alloc_pidmap_page(map);
+		if (rc)
+			break;
+
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [v9][PATCH 2/9] Have alloc_pidmap() return actual error code
       [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-10-25  3:37   ` [v9][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
@ 2009-10-25  3:37   ` Sukadev Bhattiprolu
  2009-10-25  3:38   ` [v9][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-25  3:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	matthltc-r/Jw6+rmf7HQT0dZR+AlfA, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Subject: [v9][PATCH 2/9] Have alloc_pidmap() return actual error code

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |   14 +++++++++-----
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 4c20fff..93626b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1156,10 +1156,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index 7d4bb6e..c4d9914 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidmap *map)
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
-	int rc;
+	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
@@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		} else {
 			map = &pid_ns->pidmap[0];
 			offset = RESERVED_PIDS;
-			if (unlikely(last == offset))
+			if (unlikely(last == offset)) {
+				rc = -EAGAIN;
 				break;
+			}
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return rc;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	struct upid *upid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
@@ -299,7 +303,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [v9][PATCH 3/9] Define set_pidmap() function
       [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-10-25  3:37   ` [v9][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
  2009-10-25  3:37   ` [v9][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
@ 2009-10-25  3:38   ` Sukadev Bhattiprolu
  2009-10-25  3:38   ` [v9][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
                     ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-25  3:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	matthltc-r/Jw6+rmf7HQT0dZR+AlfA, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Subject: [v9][PATCH 3/9] Define set_pidmap() function

Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.

Changelog[v9]:
	- Completely rewrote this patch based on Eric Biederman's code.
Changelog[v7]:
        - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
        - Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
        - (Eric Biederman): Avoid set_pidmap() function. Added couple of
          checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
        - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
          actually checks for 'pid <= 0' for completeness).

Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 kernel/pid.c |   40 ++++++++++++++++++++++++++++++++--------
 1 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index c4d9914..52c92b3 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+		int max)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, offset, max_scan, pid;
 	int rc = -EAGAIN;
 	struct pidmap *map;
 
 	pid = last + 1;
 	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+		pid = min;
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		rc = alloc_pidmap_page(map);
 		if (rc)
@@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
+			offset = min;
 			if (unlikely(last == offset)) {
 				rc = -EAGAIN;
 				break;
@@ -199,6 +199,30 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return rc;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+	int nr;
+
+	nr = do_alloc_pidmap(pid_ns, pid_ns->last_pid, RESERVED_PIDS, pid_max);
+	if (nr >= 0)
+		pid_ns->last_pid = nr;
+	return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+	if (!target)
+		return alloc_pidmap(pid_ns);
+
+	if (target >= pid_max)
+		return -EINVAL;
+
+	if (target < 0 || (target < RESERVED_PIDS &&
+				pid_ns->last_pid >= RESERVED_PIDS))
+		return -EINVAL;
+	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [v9][PATCH 4/9] Add target_pids parameter to alloc_pid()
       [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2009-10-25  3:38   ` [v9][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
@ 2009-10-25  3:38   ` Sukadev Bhattiprolu
  2009-10-25  3:39   ` [v9][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
                     ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-25  3:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	matthltc-r/Jw6+rmf7HQT0dZR+AlfA, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Subject: [v9][PATCH 4/9] Add target_pids parameter to alloc_pid()

This parameter is currently NULL, but will be used in a follow-on patch.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |    9 +++++++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 93626b2..3f1dddf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -980,6 +980,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1156,7 +1157,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 52c92b3..314dde0 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -280,13 +280,14 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	pid_t tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid) {
@@ -296,7 +297,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = set_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;
 
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [v9][PATCH 5/9] Add target_pids parameter to copy_process()
       [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2009-10-25  3:38   ` [v9][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
@ 2009-10-25  3:39   ` Sukadev Bhattiprolu
  2009-10-25  3:39   ` [v9][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
                     ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-25  3:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	matthltc-r/Jw6+rmf7HQT0dZR+AlfA, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Subject: [v9][PATCH 5/9] Add target_pids parameter to copy_process()

Add a 'target_pids' parameter to copy_process().  The new parameter will be
used in a follow-on patch when clone_with_pids() is implemented.

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 kernel/fork.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 3f1dddf..c8a06de 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -975,12 +975,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1361,7 +1361,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1384,6 +1384,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1424,7 +1425,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [v9][PATCH 6/9] Check invalid clone flags
       [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2009-10-25  3:39   ` [v9][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
@ 2009-10-25  3:39   ` Sukadev Bhattiprolu
       [not found]     ` <20091025033937.GG20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-10-25  3:39   ` [v9][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-25  3:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	matthltc-r/Jw6+rmf7HQT0dZR+AlfA, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Subject: [v9][PATCH 6/9] Check invalid clone flags

As pointed out by Oren Laadan, we want to ensure that unused bits in the
clone-flags remain unused and available for future. To ensure this, define
a mask of clone-flags and check the flags in the clone() system calls.

Changelog[v9]:
	- Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS
	  to avoid breaking any applications that may have set it. IOW, this
	  patch/check only applies to clone-flags bits 33 and higher.

Changelog[v8]:
	- New patch in set

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 include/linux/sched.h |   12 ++++++++++++
 kernel/fork.c         |    3 +++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 75e6e60..6b319a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -29,6 +29,18 @@
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
 
+#define CLONE_UNUSED        	0x00001000	/* Can be reused ? */
+
+#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
+				 CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\
+				 CLONE_VFORK  | CLONE_PARENT | CLONE_THREAD  |\
+				 CLONE_NEWNS  | CLONE_SYSVSEM | CLONE_SETTLS |\
+				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
+				 CLONE_DETACHED | CLONE_UNTRACED             |\
+				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
+				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
+				 CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index c8a06de..11f77ed 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -982,6 +982,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
 
+	if (clone_flags & ~VALID_CLONE_FLAGS)
+		return ERR_PTR(-EINVAL);
+
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
 
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [v9][PATCH 7/9] Define do_fork_with_pids()
       [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2009-10-25  3:39   ` [v9][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
@ 2009-10-25  3:39   ` Sukadev Bhattiprolu
  2009-10-25  3:40   ` [v9][PATCH 8/9] Define clone3() syscall Sukadev Bhattiprolu
  2009-10-25  3:40   ` [v9][PATCH 9/9] Document " Sukadev Bhattiprolu
  8 siblings, 0 replies; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-25  3:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	matthltc-r/Jw6+rmf7HQT0dZR+AlfA, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Subject: [v9][PATCH 7/9] Define do_fork_with_pids()

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v7]:
	- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
	  instead of 'struct pid_set *'.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
	- Rename 'struct target_pid_set' to 'struct pid_set' since it may
	  be useful in other contexts.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 include/linux/sched.h |    3 +++
 kernel/fork.c         |   17 +++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6b319a0..265efe5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2153,6 +2153,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				unsigned int, pid_t __user *);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/kernel/fork.c b/kernel/fork.c
index 11f77ed..210e841 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1377,12 +1377,14 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      unsigned int num_pids,
+	      pid_t __user *upids)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1485,6 +1487,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, 0, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [v9][PATCH 8/9] Define clone3() syscall
       [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2009-10-25  3:39   ` [v9][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
@ 2009-10-25  3:40   ` Sukadev Bhattiprolu
       [not found]     ` <20091025034023.GI20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-10-25  3:40   ` [v9][PATCH 9/9] Document " Sukadev Bhattiprolu
  8 siblings, 1 reply; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-25  3:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	matthltc-r/Jw6+rmf7HQT0dZR+AlfA, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Subject: [v9][PATCH 8/9] Define clone3() syscall

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

clone3(), intended for use during restart, is the same as clone(), except
that it takes a 'pids' paramter. This parameter lets caller choose
specific pid numbers for the child process, in the process's active and
ancestor pid namespaces. (Descendant pid namespaces in general don't matter
since processes don't have pids in them anyway, but see comments in
copy_target_pids() regarding CLONE_NEWPID).

Clone2() system call also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and most (all ?)
of these are in use. If a new clone flag is needed, we will be forced to
define a new variant of the clone() system call.

To prevent unprivileged processes from misusing this interface, clone3()
currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL.

See Documentation/clone3 in next patch for more details of clone3() and an
example of its usage.

NOTE:
	- System calls are restricted to 6 parameters and the number and sizes
	  of parameters needed for sys_clone3() exceed 6 integers. The new
	  prototype works around this restriction while providing some
	  flexibility if clone3() needs to be further extended in the future.
TODO:
	- We should convert clone-flags to 64-bit value in all architectures.
	  Its probably best to do that as a separate patchset since clone_flags
	  touches several functions and that patchset seems independent of this
	  new system call.

Changelog[v9]:
	- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
	  architectures split the new clone-flags into 'low' and 'high'
	  words and pass in the 'lower' flags as the first argument.
	  This would maintain similarity of the clone3() with clone()/
	  clone2(). Also has the side-effect of the name matching the
	  number of parameters :-)
	- [Roland McGrath] Rename structure to 'clone_args' and add a
	  'child_stack_size' field for IA64.

Changelog[v8]
	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
	  must be 64-bit.
	- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
	  and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 arch/x86/include/asm/syscalls.h    |    2 +
 arch/x86/include/asm/unistd_32.h   |    1 +
 arch/x86/kernel/process_32.c       |   61 +++++++++++++++++++++++
 arch/x86/kernel/syscall_table_32.S |    1 +
 include/linux/types.h              |   11 ++++
 kernel/fork.c                      |   96 +++++++++++++++++++++++++++++++++++-
 6 files changed, 171 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 372b76e..bec1e89 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -55,6 +55,8 @@ struct sel_arg_struct;
 struct oldold_utsname;
 struct old_utsname;
 
+asmlinkage long sys_clone3(unsigned int flags_low,
+			struct clone_args __user *cs, pid_t *pids);
 asmlinkage long sys_mmap2(unsigned long, unsigned long, unsigned long,
 			  unsigned long, unsigned long, unsigned long);
 asmlinkage int old_mmap(struct mmap_arg_struct __user *);
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index 6fb3c20..9357885 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -342,6 +342,7 @@
 #define __NR_pwritev		334
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
+#define __NR_clone3		337
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4cf7956..4ccdd6b 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -445,6 +445,67 @@ int sys_clone(struct pt_regs *regs)
 	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
 }
 
+asmlinkage long
+sys_clone3(unsigned int flags_low, struct clone_args __user *ucs,
+		pid_t __user *pids)
+{
+	int rc;
+	struct clone_args kcs;
+	unsigned long flags;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+	struct pt_regs *regs;
+
+	rc = copy_from_user(&kcs, ucs, sizeof(kcs));
+	if (rc)
+		return -EFAULT;
+
+	/*
+	 * TODO: If size of clone_args is not what the kernel expects, it
+	 * 	 could be that kernel is newer and has an extended structure.
+	 * 	 When that happens, this check needs to be smarter (and we
+	 * 	 need an additional copy_from_user()). For now, assume exact
+	 * 	 match.
+	 */
+	if (kcs.clone_args_size != sizeof(kcs))
+		return -EINVAL;
+
+	/*
+	 * To avoid future compatibility issues, ensure unused fields are 0.
+	 */
+	if (kcs.reserved1 || kcs.clone_flags_high)
+		return -EINVAL;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 * 	 higher word(s) of 'flags':
+	 *
+	 * 		flags = (kcs.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tid_ptr = (int *)kcs.parent_tid_ptr;
+	child_tid_ptr =  (int *)kcs.child_tid_ptr;
+
+	stack_size = (unsigned long)kcs.child_stack_size;
+	child_stack = (unsigned long)kcs.child_stack_base + stack_size;
+
+	regs = task_pt_regs(current);
+
+	if (!child_stack)
+		child_stack = user_stack_pointer(regs);
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * 	 to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kcs.nr_pids,
+				pids);
+}
+
 /*
  * sys_execve() executes a new program.
  */
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0157cd2..e3e82e9 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -336,3 +336,4 @@ ENTRY(sys_call_table)
 	.long sys_pwritev
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
+	.long sys_clone3
diff --git a/include/linux/types.h b/include/linux/types.h
index c42724f..626c60b 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -204,6 +204,17 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack_base;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 clone_args_size;
+	u64 reserved1;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 210e841..162cd37 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1372,6 +1372,86 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+	int j;
+	int rc;
+	int size;
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!unum_pids)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
+	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
+	 * 	 the order is:
+	 *
+	 * 		- pids for 'unknown-namespaces' (if any)
+	 * 		- pids for 'known-namespaces' (requested pids)
+	 * 		- 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], upids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1389,7 +1469,7 @@ long do_fork_with_pids(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1423,6 +1503,16 @@ long do_fork_with_pids(unsigned long clone_flags,
 		}
 	}
 
+	target_pids = copy_target_pids(num_pids, upids);
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1484,6 +1574,10 @@ long do_fork_with_pids(unsigned long clone_flags,
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [v9][PATCH 9/9] Document clone3() syscall
       [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2009-10-25  3:40   ` [v9][PATCH 8/9] Define clone3() syscall Sukadev Bhattiprolu
@ 2009-10-25  3:40   ` Sukadev Bhattiprolu
       [not found]     ` <20091025034050.GJ20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  8 siblings, 1 reply; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2009-10-25  3:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, serue-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Alexey Dobriyan, Pavel Emelyanov,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mikew-hpIqsD4AKlfQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Nathan Lynch,
	matthltc-r/Jw6+rmf7HQT0dZR+AlfA, arnd-r2nGTMty4D4,
	peterz-wEGCiKHe2LqWVfeAwA7xHQ,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	roland-H+wXaHxf7aLQT0dZR+AlfA,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A,
	randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	sukadev-r/Jw6+rmf7HQT0dZR+AlfA


Subject: [v9][PATCH 9/9] Document clone3() syscall

This gives a brief overview of the clone3() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v9]:
	- [Pavel Machek]: Fix an inconsistency and rename new file to
	  Documentation/clone3.
	- [Roland McGrath, H. Peter Anvin] Updates to description and
	  example to reflect new prototype of clone3() and the updated/
	  renamed 'struct clone_args'.

Changelog[v8]:
	- clone2() is already in use in IA64. Rename syscall to clone3()
	- Add notes to say that we return -EINVAL if invalid clone flags
	  are specified or if the reserved fields are not 0.
Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Signed-off-by: Sukadev Bhattiprolu <sukadev-8jLBTbqmX/OZamtmwQBW5tBPR1lH4CV8@public.gmane.org>
---
 Documentation/clone3 |  191 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 191 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/clone3

diff --git a/Documentation/clone3 b/Documentation/clone3
new file mode 100644
index 0000000..466fac2
--- /dev/null
+++ b/Documentation/clone3
@@ -0,0 +1,191 @@
+
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack_base;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 clone_args_size;
+	u64 reserved1;
+};
+
+
+clone3(u32 flags_low, struct clone_args * __user cargs, pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does,
+	the clone3() system call:
+
+		- allows additional clone flags (31 of 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid name spaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint.  Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @flags_low parameter is identical to the 'clone_flags' parameter
+	in existing clone() system call.
+
+	The fields in 'struct clone_args' are meant to be used as follows:
+
+	u64 clone_flags_high:
+
+		When clone3() supports more than 32 clone flags, the higher
+		bits in the clone_flags should be specified in this field.
+		This field is currently unused and must be set to 0.
+
+	u64 child_stack_base;
+	u64 child_stack_size;
+
+		These two fields correspond to the 'child_stack' fields
+		in clone() and clone2() system calls (on IA64).
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+		These two fields correspond to the 'parent_tid_ptr' and
+		'child_tid_ptr' fields in the clone() system call
+
+	u32 nr_pids;
+
+		nr_pids specifies the number of pids in the @pids array
+		parameter to clone3() (see below). nr_pids should not exceed
+		the current nesting level of the calling process (i.e if the
+		process is in init_pid_ns, nr_pids must be 1, if process is
+		in a pid namespace that is a child of init-pid-ns, nr_pids
+		cannot exceed 2, and so on).
+
+	u32 clone_args_size;
+
+		clone_args_size specifes the sizeof(struct clone_args) and is
+		intended to enable extending this structure in the future,
+		while preserving backward compatibility.  For now, this field
+		must be set to the sizeof(struct clone_args) and this size must
+		match the kernel's view of the structure.
+
+	u64 reserved1;
+
+		reserved1 is intended to enable extending the functionality
+		of the clone3() system call in the future, while preserving
+		backward compatibility. It must currently be set to 0.
+
+
+	The @pids parameter defines the set of pids that should be assigned to
+	the child process in its active and ancestor pid name spaces. The
+	descendant pid namespaces do not matter since a process does not have a
+	pid in descendant namespaces, unless the process is in a new pid
+	namespace in which case the process is a container-init (and must have
+	the pid 1 in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	The order pids in @pids corresponds to the nesting order of pid-
+	namespaces, with @pids[0] corresponding to the init_pid_ns.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace, for the process.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails (see EBUSY below).
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, clone3() returns -1 and sets 'errno' to one of following
+	values (the child process is not created).
+
+	EPERM	Caller does not have the SYS_ADMIN privilege needed to excute
+		this call.
+
+	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EINVAL	Not all specified clone-flags are valid.
+
+	EINVAL	The reserved fields in the clone_args argument are not 0.
+
+	EBUSY	A requested pid is in use by another process in that name space.
+
+---
+/* Example usage of clone3() on i386 */
+
+#include <stdio.h>
+#include <signal.h>
+#include <errno.h>
+
+#define __NR_clone3	337
+#define TEST_PID	399
+#define STACKSIZE	8192
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef int pid_t;
+
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack_base;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 clone_args_size;
+	u64 reserved1;
+};
+
+int do_child(void *arg)
+{
+	printf("Child, pid %d, arg %s\n", getpid(), arg);
+
+	if (getpid() != TEST_PID)
+		printf("Expected pid %d, actual %d\n", TEST_PID, getpid());
+
+	_Exit(0);
+}
+
+main()
+{
+	int rc;
+	void **stack;
+	struct clone_args cargs;
+
+	u32 flags_low 	= SIGCHLD;
+	char *arg_str 	= "Args for child: abcdefg";
+	pid_t pids[] 	= { 377, TEST_PID };
+
+	stack = (void **)(malloc(STACKSIZE) + STACKSIZE - 1);
+
+	/* Set up stack for child */
+	*--stack = arg_str;
+	*--stack = NULL;
+	*--stack = do_child;
+
+	cargs.clone_flags_high = (u64)0;
+	cargs.child_stack_base = (u64)stack;
+	cargs.child_stack_size = (u64)0;
+
+	cargs.nr_pids = 2;              /* assumes we are in a child pid ns */
+	cargs.parent_tid_ptr = (u64)0;
+	cargs.child_tid_ptr = (u64)0;
+
+	cargs.clone_args_size = sizeof(cargs);
+	cargs.reserved1 = (u64)0;
+
+	rc = syscall(__NR_clone3, flags_low, &cargs, &pids);
+
+	if (rc != TEST_PID) {
+		printf("Parent: expected rc %d, actual %d, errno %d\n",
+				 TEST_PID, rc, errno);
+	} else {
+		printf("Parent: clone3() returns %d, errno %d\n", rc, errno);
+	}
+
+	waitpid(-1, NULL, 0);
+}
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [v9][PATCH 6/9] Check invalid clone flags
       [not found]     ` <20091025033937.GG20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-25 17:08       ` Oren Laadan
  0 siblings, 0 replies; 14+ messages in thread
From: Oren Laadan @ 2009-10-25 17:08 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Andrew Morton, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Pavel Emelyanov,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Alexey Dobriyan, Eric W. Biederman, arnd-r2nGTMty4D4,
	Nathan Lynch, roland-H+wXaHxf7aLQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b



Sukadev Bhattiprolu wrote:
> Subject: [v9][PATCH 6/9] Check invalid clone flags
> 
> As pointed out by Oren Laadan, we want to ensure that unused bits in the
> clone-flags remain unused and available for future. To ensure this, define
> a mask of clone-flags and check the flags in the clone() system calls.
> 
> Changelog[v9]:
> 	- Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS
> 	  to avoid breaking any applications that may have set it. IOW, this
> 	  patch/check only applies to clone-flags bits 33 and higher.
> 
> Changelog[v8]:
> 	- New patch in set
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>

Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

> ---
>  include/linux/sched.h |   12 ++++++++++++
>  kernel/fork.c         |    3 +++
>  2 files changed, 15 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 75e6e60..6b319a0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -29,6 +29,18 @@
>  #define CLONE_NEWNET		0x40000000	/* New network namespace */
>  #define CLONE_IO		0x80000000	/* Clone io context */
>  
> +#define CLONE_UNUSED        	0x00001000	/* Can be reused ? */
> +
> +#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
> +				 CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\
> +				 CLONE_VFORK  | CLONE_PARENT | CLONE_THREAD  |\
> +				 CLONE_NEWNS  | CLONE_SYSVSEM | CLONE_SETTLS |\
> +				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
> +				 CLONE_DETACHED | CLONE_UNTRACED             |\
> +				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
> +				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
> +				 CLONE_NEWPID | CLONE_NEWNET| CLONE_IO)
> +
>  /*
>   * Scheduling policies
>   */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index c8a06de..11f77ed 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -982,6 +982,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	struct task_struct *p;
>  	int cgroup_callbacks_done = 0;
>  
> +	if (clone_flags & ~VALID_CLONE_FLAGS)
> +		return ERR_PTR(-EINVAL);
> +
>  	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
>  		return ERR_PTR(-EINVAL);
>  

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [v9][PATCH 8/9] Define clone3() syscall
       [not found]     ` <20091025034023.GI20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-25 17:17       ` Linus Torvalds
  2009-10-25 17:23       ` Oren Laadan
  1 sibling, 0 replies; 14+ messages in thread
From: Linus Torvalds @ 2009-10-25 17:17 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Pavel Emelyanov,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Alexey Dobriyan, Eric W. Biederman, arnd-r2nGTMty4D4,
	Nathan Lynch, roland-H+wXaHxf7aLQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers, Andrew Morton



On Sat, 24 Oct 2009, Sukadev Bhattiprolu wrote:
> 
> Subject: [v9][PATCH 8/9] Define clone3() syscall

Ok, guys, can you please un-cc me from this endless discussion. Two 
reasons:

 - The name "clone3" sucks. And no, "wait3" and "wait4" are not good 
   reasons to call something "clone3"

but more importantly:

 - I told people _months_ ago that you can't do this with an architecture- 
   neutral wrapper system call and not that fake-portable task_pt_regs() 
   thing that IS NOT APPROPRIATE in this context. People apparently didn't 
   listen. So why continue to cc me on this interminable thread, when it's 
   apparently just a big masturbatory email session about naming?

The patches will not be merged. You don't listen to any technical 
feedback, and instead waste everybodys time with pointless bikeshed 
painting discussions. Please don't cc me any more.

		Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [v9][PATCH 9/9] Document clone3() syscall
       [not found]     ` <20091025034050.GJ20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-25 17:21       ` Oren Laadan
  0 siblings, 0 replies; 14+ messages in thread
From: Oren Laadan @ 2009-10-25 17:21 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Andrew Morton, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Pavel Emelyanov,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Alexey Dobriyan, Eric W. Biederman, arnd-r2nGTMty4D4,
	Nathan Lynch, roland-H+wXaHxf7aLQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b



Sukadev Bhattiprolu wrote:
> Subject: [v9][PATCH 9/9] Document clone3() syscall
> 
> This gives a brief overview of the clone3() system call.  We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.
> 
> Changelog[v9]:
> 	- [Pavel Machek]: Fix an inconsistency and rename new file to
> 	  Documentation/clone3.
> 	- [Roland McGrath, H. Peter Anvin] Updates to description and
> 	  example to reflect new prototype of clone3() and the updated/
> 	  renamed 'struct clone_args'.
> 
> Changelog[v8]:
> 	- clone2() is already in use in IA64. Rename syscall to clone3()
> 	- Add notes to say that we return -EINVAL if invalid clone flags
> 	  are specified or if the reserved fields are not 0.
> Changelog[v7]:
> 	- Rename clone_with_pids() to clone2()
> 	- Changes to reflect new prototype of clone2() (using clone_struct).
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev-8jLBTbqmX/OZamtmwQBW5tBPR1lH4CV8@public.gmane.org>

A couple of nits below; otherwise:

Acked-by: Oren Laadan  <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

> ---
>  Documentation/clone3 |  191 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 191 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/clone3
> 
> diff --git a/Documentation/clone3 b/Documentation/clone3
> new file mode 100644
> index 0000000..466fac2
> --- /dev/null
> +++ b/Documentation/clone3
> @@ -0,0 +1,191 @@
> +
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack_base;
> +	u64 child_stack_size;
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +	u32 nr_pids;
> +	u32 clone_args_size;
> +	u64 reserved1;
> +};
> +
> +
> +clone3(u32 flags_low, struct clone_args * __user cargs, pid_t * __user pids)
> +
> +	In addition to doing everything that clone() system call does,
> +	the clone3() system call:
> +
> +		- allows additional clone flags (31 of 32 bits in the flags
> +		  parameter to clone() are in use)
> +
> +		- allows user to specify a pid for the child process in its
> +		  active and ancestor pid name spaces.
> +
> +	This system call is meant to be used when restarting an application
> +	from a checkpoint.  Such restart requires that the processes in the
> +	application have the same pids they had when the application was
> +	checkpointed. When containers are nested, the processes within the
> +	containers exist in multiple pid namespaces and hence have multiple
> +	pids to specify during restart.
> +
> +	The @flags_low parameter is identical to the 'clone_flags' parameter
> +	in existing clone() system call.
> +
> +	The fields in 'struct clone_args' are meant to be used as follows:
> +
> +	u64 clone_flags_high:
> +
> +		When clone3() supports more than 32 clone flags, the higher
								     ^^^^^^
s/higher/additional/ ?

> +		bits in the clone_flags should be specified in this field.
> +		This field is currently unused and must be set to 0.
> +
> +	u64 child_stack_base;
> +	u64 child_stack_size;
> +
> +		These two fields correspond to the 'child_stack' fields
> +		in clone() and clone2() system calls (on IA64).
> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +		These two fields correspond to the 'parent_tid_ptr' and
> +		'child_tid_ptr' fields in the clone() system call
> +
> +	u32 nr_pids;
> +
> +		nr_pids specifies the number of pids in the @pids array
> +		parameter to clone3() (see below). nr_pids should not exceed
> +		the current nesting level of the calling process (i.e if the
> +		process is in init_pid_ns, nr_pids must be 1, if process is
> +		in a pid namespace that is a child of init-pid-ns, nr_pids
> +		cannot exceed 2, and so on).
> +
> +	u32 clone_args_size;
> +
> +		clone_args_size specifes the sizeof(struct clone_args) and is
> +		intended to enable extending this structure in the future,
> +		while preserving backward compatibility.  For now, this field
> +		must be set to the sizeof(struct clone_args) and this size must
> +		match the kernel's view of the structure.
> +
> +	u64 reserved1;
> +
> +		reserved1 is intended to enable extending the functionality
> +		of the clone3() system call in the future, while preserving
> +		backward compatibility. It must currently be set to 0.
> +
> +
> +	The @pids parameter defines the set of pids that should be assigned to
> +	the child process in its active and ancestor pid name spaces. The
						    ^^^^^^^^^^
s/name spaces/namespaces/

> +	descendant pid namespaces do not matter since a process does not have a
> +	pid in descendant namespaces, unless the process is in a new pid
> +	namespace in which case the process is a container-init (and must have
> +	the pid 1 in that namespace).
> +
> +	See CLONE_NEWPID section of clone(2) man page for details about pid
> +	namespaces.
> +
> +	The order pids in @pids corresponds to the nesting order of pid-
	       ^^^^^
s/order/order of/

> +	namespaces, with @pids[0] corresponding to the init_pid_ns.
			 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is only true when the caller provides that many pids in the array.
If the caller provides 3 pids at a nesting level 6, then @pids[0]
corresponds to level 4 pid-ns.

> +
> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace, for the process.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails (see EBUSY below).
> +
> +	On success, the system call returns the pid of the child process in
> +	the parent's active pid namespace.
> +
> +	On failure, clone3() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).
> +
> +	EPERM	Caller does not have the SYS_ADMIN privilege needed to excute
					^^^^^^^^^^^		   ^^^^^^^^^^
s/SYS_ADMIN/CAP_SYS_ADMIN
s/execute this call/to specify pids in this call./

> +		this call.

> +
> +	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
> +		the current nesting level of parent process
> +
> +	EINVAL	Not all specified clone-flags are valid.
> +
> +	EINVAL	The reserved fields in the clone_args argument are not 0.
> +
> +	EBUSY	A requested pid is in use by another process in that name space.
> +
> +---

[...]

Oren.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [v9][PATCH 8/9] Define clone3() syscall
       [not found]     ` <20091025034023.GI20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-10-25 17:17       ` Linus Torvalds
@ 2009-10-25 17:23       ` Oren Laadan
  1 sibling, 0 replies; 14+ messages in thread
From: Oren Laadan @ 2009-10-25 17:23 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Andrew Morton, randy.dunlap-QHcLZuEGTsvQT0dZR+AlfA,
	mtk.manpages-gM/Ye1E23mwN+BqQ9rBEUg, pavel-+ZI9xUNit7I,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, Pavel Emelyanov,
	Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ,
	kosaki.motohiro-+CUm20s59erQFUHtdCDX3A, mingo-X9Un+BFzKDI,
	Alexey Dobriyan, Eric W. Biederman, arnd-r2nGTMty4D4,
	Nathan Lynch, roland-H+wXaHxf7aLQT0dZR+AlfA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Containers,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b



Sukadev Bhattiprolu wrote:
> Subject: [v9][PATCH 8/9] Define clone3() syscall
> 
> Container restart requires that a task have the same pid it had when it was
> checkpointed. When containers are nested the tasks within the containers
> exist in multiple pid namespaces and hence have multiple pids to specify
> during restart.
> 
> clone3(), intended for use during restart, is the same as clone(), except
> that it takes a 'pids' paramter. This parameter lets caller choose
> specific pid numbers for the child process, in the process's active and
> ancestor pid namespaces. (Descendant pid namespaces in general don't matter
> since processes don't have pids in them anyway, but see comments in
> copy_target_pids() regarding CLONE_NEWPID).
> 
> Clone2() system call also attempts to address a second limitation of the
^^^^^^^^^^
s/Clone2/Clone3/

Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2009-10-25 17:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-25  3:35 [v9][PATCH 0/9] Implement clone3() system call Sukadev Bhattiprolu
     [not found] ` <20091025033508.GA20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-25  3:37   ` [v9][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
2009-10-25  3:37   ` [v9][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
2009-10-25  3:38   ` [v9][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
2009-10-25  3:38   ` [v9][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
2009-10-25  3:39   ` [v9][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2009-10-25  3:39   ` [v9][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
     [not found]     ` <20091025033937.GG20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-25 17:08       ` Oren Laadan
2009-10-25  3:39   ` [v9][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
2009-10-25  3:40   ` [v9][PATCH 8/9] Define clone3() syscall Sukadev Bhattiprolu
     [not found]     ` <20091025034023.GI20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-25 17:17       ` Linus Torvalds
2009-10-25 17:23       ` Oren Laadan
2009-10-25  3:40   ` [v9][PATCH 9/9] Document " Sukadev Bhattiprolu
     [not found]     ` <20091025034050.GJ20327-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-25 17:21       ` Oren Laadan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).