public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [v12][PATCH 0/9] Implement eclone() syscall
       [not found] <20091111043440.GA9377@suka>
@ 2009-11-11 22:38 ` Sukadev Bhattiprolu
       [not found] ` <20091111044250.GA11393@suka>
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
	Pavel Emelyanov, hpa, Nathan Lynch, matthltc, arnd, roland,
	mtk.manpages, linux-api, Containers, linux-kernel

Cc: LKML

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| 
| Based on discussions on containers mailing list and IRC, we settled on
| the name eclone(). Please let me know of a better name or if there are
| other comments on the patchset.
| 
| ---
| 
| Subject: [v12][PATCH 0/9] Implement eclone() syscall
| 
| To support application checkpoint/restart, a task must have the same pid it
| had when it was checkpointed.  When containers are nested, the tasks within
| the containers exist in multiple pid namespaces and hence have multiple pids
| to specify during restart.
| 
| This patchset implements a new system call, eclone() that lets a process
| specify the pids of the child process.
| 
| Patches 1 through 6 are helper patches needed for choosing a pid for the
| child process.
| 
| PATCH 8 implements the eclone() system call on x86. The interface defined in
| PATCH 8 has been ported to s390 and ppc64 architectures, but they will be
| posted as a separate patchset if this patchset is accepted.
| 
| PATCH 9 adds some documentation on the new system call, some/all of which
| will eventually go into a man page.
| 
| Changelog[v12]:
| 	- Ignore ->child_stack_size when ->child_stack_base is NULL (PATCH 8)
| 	- Cleanup/simplify example in Documentation/eclone (PATCH 9).
| 	- Rename sys call to a shorter name, eclone()
| 
| Changelog[v11]:
| 	- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
| 	  code.
| 	- [Oren Laadan] Make args_size a parameter to system call and remove
| 	  it from 'struct clone_args'
| 
| Changelog[v10]:
| 	- [Linus Torvalds] Use PTREGSCALL() implementation for clone rather
| 	  than the generic system call
| 	- Rename clone3() to clone_with_pids()
| 	- Update Documentation/clone_with_pids() to show example usage with
| 	  the PTREGSCALL implementation.
| 
| Changelog[v9]:
| 	- [Pavel Emelyanov] Drop the patch that made 'pid_max' a property
| 	  of struct pid_namespace
| 	- [Roland McGrath, H. Peter Anvin and earlier on, Serge Hallyn] To
| 	  avoid inadvertent truncation clone_flags, preserve the first
| 	  parameter of clone3() as 'u32 clone_flags' and specify newer
| 	  flags in clone_args.flags_high (PATCH 8/9 and PATCH 9/9)
| 	- [Eric Biederman] Generalize alloc_pidmap() code to simplify and
| 	  remove duplication (see PATCH 3/9].
| 	  
| Changelog[v8]:
| 	- [Oren Laadan, Louis Rilling, KOSAKI Motohiro]
| 	  The name 'clone2()' is in use - renamed new syscall to clone3().
| 	- [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit.
| 	- [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0.
| 	  (Added [PATCH 7/10] to the patchset).
| 
| Changelog[v7]:
| 	- [Peter Zijlstra, Arnd Bergmann]
| 	  Group the arguments to clone2() into a 'struct clone_arg' to
| 	  workaround the issue of exceeding 6 arguments to the system call.
| 	  Also define clone-flags as u64 to allow additional clone-flags.
| 
| Changelog[v6]:
| 	- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
| 	  Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
| 	  constant across architectures (Patches 7, 8).
| 	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
| 	  'unum_pids < 0' check (Patches 7,8)
| 	- (Pavel Machek) New patch (Patch 9) to add some documentation.
| 
| Changelog[v5]:
| 	- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
| 	  into this set)
| 	- (Eric Biederman): Avoid the new function, set_pidmap() - added
| 	  couple of checks on 'target_pid' in alloc_pidmap() itself.
| 
| === IMPORTANT NOTE:
| 
| clone() system call has another limitation - all but one bits in clone-flags
| are in use and if more new clone-flags are needed, we will need a variant of
| the clone() system call. 
| 
| It appears to make sense to try and extend this new system call to address
| this limitation as well. The requirements of a new clone system call could
| then be summarized as:
| 
| 	- do everything clone() does today, and
| 	- give application an ability to choose pids for the child process
| 	  in all ancestor pid namespaces, and
| 	- allow more clone_flags
| 
| Contstraints:
| 
| 	- system-calls are restricted to 6 parameters and clone() already
| 	  takes 5 parameters, any extension to clone() interface would require
| 	  one or more copy_from_user().  (Not sure if copy_from_user() of ~40
| 	  bytes would have a significant impact on performance of clone()).
| 
| Based on these requirements and constraints, we explored a couple of system
| call interfaces (in earlier versions of this patchset). Based on input from
| Arnd Bergmann and others, the new interface of the system call is: 
| 
| 	struct clone_args {
| 		u64 clone_flags_high;
| 		u64 child_stack_base;
| 		u64 child_stack_size;
| 		u64 parent_tid_ptr;
| 		u64 child_tid_ptr;
| 		u32 nr_pids;
| 		u32 reserved0;
| 		u64 reserved1;
| 	};
| 
| 	sys_eclone(u32 flags_low, struct clone_args *cargs, int args_size,
| 			pid_t *pids)
| 
| Details of the struct clone_args and the usage are explained in the
| documentation (PATCH 9/9).
| 
| NOTE:
| 	While this patchset enables support for more clone-flags, actual
| 	implementation for additional clone-flags is best implemented as
| 	a separate patchset (PATCH 8/9 identifies some TODOs)
| 
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v12][PATCH 1/9] Factor out code to allocate pidmap page
       [not found] ` <20091111044250.GA11393@suka>
@ 2009-11-11 22:38   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
	Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
	Pavel Emelyanov, linux-kernel

Cc LKML

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| 
| Subject: [v12][PATCH 1/9] Factor out code to allocate pidmap page
| 
| To simplify alloc_pidmap(), move code to allocate a pid map page to a
| separate function.
| 
| Changelog[v3]:
| 	- Earlier version of patchset called alloc_pidmap_page() from two
| 	  places. But now its called from only one place. Even so, moving
| 	  this code out into a separate function simplifies alloc_pidmap().
| Changelog[v2]:
| 	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
| 	  -ENOMEM on error instead of -1.
| 
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Serge Hallyn <serue@us.ibm.com>
| Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
|  kernel/pid.c |   45 ++++++++++++++++++++++++++++++---------------
|  1 files changed, 30 insertions(+), 15 deletions(-)
| 
| diff --git a/kernel/pid.c b/kernel/pid.c
| index d3f722d..7d4bb6e 100644
| --- a/kernel/pid.c
| +++ b/kernel/pid.c
| @@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upid)
|  	atomic_inc(&map->nr_free);
|  }
| 
| +static int alloc_pidmap_page(struct pidmap *map)
| +{
| +	void *page;
| +
| +	if (likely(map->page))
| +		return 0;
| +
| +	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
| +
| +	/*
| +	 * Free the page if someone raced with us installing it:
| +	 */
| +	spin_lock_irq(&pidmap_lock);
| +	if (map->page)
| +		kfree(page);
| +	else
| +		map->page = page;
| +	spin_unlock_irq(&pidmap_lock);
| +
| +	if (unlikely(!map->page))
| +		return -ENOMEM;
| +
| +	return 0;
| +}
| +
|  static int alloc_pidmap(struct pid_namespace *pid_ns)
|  {
|  	int i, offset, max_scan, pid, last = pid_ns->last_pid;
| +	int rc;
|  	struct pidmap *map;
| 
|  	pid = last + 1;
| @@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
|  	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
|  	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
|  	for (i = 0; i <= max_scan; ++i) {
| -		if (unlikely(!map->page)) {
| -			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
| -			/*
| -			 * Free the page if someone raced with us
| -			 * installing it:
| -			 */
| -			spin_lock_irq(&pidmap_lock);
| -			if (map->page)
| -				kfree(page);
| -			else
| -				map->page = page;
| -			spin_unlock_irq(&pidmap_lock);
| -			if (unlikely(!map->page))
| -				break;
| -		}
| +		rc = alloc_pidmap_page(map);
| +		if (rc)
| +			break;
| +
|  		if (likely(atomic_read(&map->nr_free))) {
|  			do {
|  				if (!test_and_set_bit(offset, map->page)) {
| -- 
| 1.6.0.4
| 
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v12][PATCH 2/9] Have alloc_pidmap() return actual error code
       [not found] ` <20091111044313.GB11393@suka>
@ 2009-11-11 22:39   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
	Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
	Pavel Emelyanov, linux-kernel

CC: LKML

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| 
| Subject: [v12][PATCH 2/9] Have alloc_pidmap() return actual error code
| 
| alloc_pidmap() can fail either because all pid numbers are in use or
| because memory allocation failed.  With support for setting a specific
| pid number, alloc_pidmap() would also fail if either the given pid
| number is invalid or in use.
| 
| Rather than have callers assume -ENOMEM, have alloc_pidmap() return
| the actual error.
| 
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Serge Hallyn <serue@us.ibm.com>
| Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
|  kernel/fork.c |    5 +++--
|  kernel/pid.c  |   14 +++++++++-----
|  2 files changed, 12 insertions(+), 7 deletions(-)
| 
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 4c20fff..93626b2 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -1156,10 +1156,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
|  		goto bad_fork_cleanup_io;
| 
|  	if (pid != &init_struct_pid) {
| -		retval = -ENOMEM;
|  		pid = alloc_pid(p->nsproxy->pid_ns);
| -		if (!pid)
| +		if (IS_ERR(pid)) {
| +			retval = PTR_ERR(pid);
|  			goto bad_fork_cleanup_io;
| +		}
| 
|  		if (clone_flags & CLONE_NEWPID) {
|  			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
| diff --git a/kernel/pid.c b/kernel/pid.c
| index 7d4bb6e..c4d9914 100644
| --- a/kernel/pid.c
| +++ b/kernel/pid.c
| @@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidmap *map)
|  static int alloc_pidmap(struct pid_namespace *pid_ns)
|  {
|  	int i, offset, max_scan, pid, last = pid_ns->last_pid;
| -	int rc;
| +	int rc = -EAGAIN;
|  	struct pidmap *map;
| 
|  	pid = last + 1;
| @@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
|  		} else {
|  			map = &pid_ns->pidmap[0];
|  			offset = RESERVED_PIDS;
| -			if (unlikely(last == offset))
| +			if (unlikely(last == offset)) {
| +				rc = -EAGAIN;
|  				break;
| +			}
|  		}
|  		pid = mk_pid(pid_ns, map, offset);
|  	}
| -	return -1;
| +	return rc;
|  }
| 
|  int next_pidmap(struct pid_namespace *pid_ns, int last)
| @@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
|  	struct upid *upid;
| 
|  	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
| -	if (!pid)
| +	if (!pid) {
| +		pid = ERR_PTR(-ENOMEM);
|  		goto out;
| +	}
| 
|  	tmp = ns;
|  	for (i = ns->level; i >= 0; i--) {
| @@ -299,7 +303,7 @@ out_free:
|  		free_pidmap(pid->numbers + i);
| 
|  	kmem_cache_free(ns->pid_cachep, pid);
| -	pid = NULL;
| +	pid = ERR_PTR(nr);
|  	goto out;
|  }
| 
| -- 
| 1.6.0.4
| 
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v12][PATCH 3/9] Define set_pidmap() function
       [not found] ` <20091111044329.GC11393@suka>
@ 2009-11-11 22:39   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
	Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
	Pavel Emelyanov, linux-kernel

Cc: LKML

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| 
| Subject: [v12][PATCH 3/9] Define set_pidmap() function
| 
| Define a set_pidmap() interface which is like alloc_pidmap() only that
| caller specifies the pid number to be assigned.
| 
| Changelog[v9]:
| 	- Completely rewrote this patch based on Eric Biederman's code.
| Changelog[v7]:
|         - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
| Changelog[v6]:
|         - Separate target_pid > 0 case to minimize the number of checks needed.
| Changelog[v3]:
|         - (Eric Biederman): Avoid set_pidmap() function. Added couple of
|           checks for target_pid in alloc_pidmap() itself.
| Changelog[v2]:
|         - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
|           actually checks for 'pid <= 0' for completeness).
| 
| Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
| Acked-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
|  kernel/pid.c |   41 +++++++++++++++++++++++++++++++++--------
|  1 files changed, 33 insertions(+), 8 deletions(-)
| 
| diff --git a/kernel/pid.c b/kernel/pid.c
| index c4d9914..afef766 100644
| --- a/kernel/pid.c
| +++ b/kernel/pid.c
| @@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
|  	return 0;
|  }
| 
| -static int alloc_pidmap(struct pid_namespace *pid_ns)
| +static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
| +		int max)
|  {
| -	int i, offset, max_scan, pid, last = pid_ns->last_pid;
| +	int i, offset, max_scan, pid;
|  	int rc = -EAGAIN;
|  	struct pidmap *map;
| 
|  	pid = last + 1;
|  	if (pid >= pid_max)
| -		pid = RESERVED_PIDS;
| +		pid = min;
|  	offset = pid & BITS_PER_PAGE_MASK;
|  	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
| -	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
| +	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
|  	for (i = 0; i <= max_scan; ++i) {
|  		rc = alloc_pidmap_page(map);
|  		if (rc)
| @@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
|  			do {
|  				if (!test_and_set_bit(offset, map->page)) {
|  					atomic_dec(&map->nr_free);
| -					pid_ns->last_pid = pid;
|  					return pid;
|  				}
|  				offset = find_next_offset(map, offset);
| @@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
|  			 * bitmap block and the final block was the same
|  			 * as the starting point, pid is before last_pid.
|  			 */
| -			} while (offset < BITS_PER_PAGE && pid < pid_max &&
| +			} while (offset < BITS_PER_PAGE && pid < max &&
|  					(i != max_scan || pid < last ||
|  					    !((last+1) & BITS_PER_PAGE_MASK)));
|  		}
| -		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
| +		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
|  			++map;
|  			offset = 0;
|  		} else {
|  			map = &pid_ns->pidmap[0];
| -			offset = RESERVED_PIDS;
| +			offset = min;
|  			if (unlikely(last == offset)) {
|  				rc = -EAGAIN;
|  				break;
| @@ -199,6 +199,31 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
|  	return rc;
|  }
| 
| +static int alloc_pidmap(struct pid_namespace *pid_ns)
| +{
| +	int nr;
| +
| +	nr = do_alloc_pidmap(pid_ns, pid_ns->last_pid, RESERVED_PIDS, pid_max);
| +	if (nr >= 0)
| +		pid_ns->last_pid = nr;
| +	return nr;
| +}
| +
| +static int set_pidmap(struct pid_namespace *pid_ns, int target)
| +{
| +	if (!target)
| +		return alloc_pidmap(pid_ns);
| +
| +	if (target >= pid_max)
| +		return -EINVAL;
| +
| +	if ((target < 0) || (target < RESERVED_PIDS &&
| +				pid_ns->last_pid >= RESERVED_PIDS))
| +		return -EINVAL;
| +
| +	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
| +}
| +
|  int next_pidmap(struct pid_namespace *pid_ns, int last)
|  {
|  	int offset;
| -- 
| 1.6.0.4
| 
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v12][PATCH 4/9] Add target_pids parameter to alloc_pid()
       [not found] ` <20091111044347.GD11393@suka>
@ 2009-11-11 22:39   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
	Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
	Pavel Emelyanov, linux-kernel

Cc: LKML

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| 
| Subject: [v12][PATCH 4/9] Add target_pids parameter to alloc_pid()
| 
| This parameter is currently NULL, but will be used in a follow-on patch.
| 
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Serge Hallyn <serue@us.ibm.com>
| Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
|  include/linux/pid.h |    2 +-
|  kernel/fork.c       |    3 ++-
|  kernel/pid.c        |    9 +++++++--
|  3 files changed, 10 insertions(+), 4 deletions(-)
| 
| diff --git a/include/linux/pid.h b/include/linux/pid.h
| index 49f1c2f..914185d 100644
| --- a/include/linux/pid.h
| +++ b/include/linux/pid.h
| @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
|  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
|  int next_pidmap(struct pid_namespace *pid_ns, int last);
| 
| -extern struct pid *alloc_pid(struct pid_namespace *ns);
| +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
|  extern void free_pid(struct pid *pid);
| 
|  /*
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 93626b2..3f1dddf 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -980,6 +980,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
|  	int retval;
|  	struct task_struct *p;
|  	int cgroup_callbacks_done = 0;
| +	pid_t *target_pids = NULL;
| 
|  	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
|  		return ERR_PTR(-EINVAL);
| @@ -1156,7 +1157,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
|  		goto bad_fork_cleanup_io;
| 
|  	if (pid != &init_struct_pid) {
| -		pid = alloc_pid(p->nsproxy->pid_ns);
| +		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
|  		if (IS_ERR(pid)) {
|  			retval = PTR_ERR(pid);
|  			goto bad_fork_cleanup_io;
| diff --git a/kernel/pid.c b/kernel/pid.c
| index afef766..3ee52cd 100644
| --- a/kernel/pid.c
| +++ b/kernel/pid.c
| @@ -281,13 +281,14 @@ void free_pid(struct pid *pid)
|  	call_rcu(&pid->rcu, delayed_put_pid);
|  }
| 
| -struct pid *alloc_pid(struct pid_namespace *ns)
| +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
|  {
|  	struct pid *pid;
|  	enum pid_type type;
|  	int i, nr;
|  	struct pid_namespace *tmp;
|  	struct upid *upid;
| +	pid_t tpid;
| 
|  	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
|  	if (!pid) {
| @@ -297,7 +298,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
| 
|  	tmp = ns;
|  	for (i = ns->level; i >= 0; i--) {
| -		nr = alloc_pidmap(tmp);
| +		tpid = 0;
| +		if (target_pids)
| +			tpid = target_pids[i];
| +
| +		nr = set_pidmap(tmp, tpid);
|  		if (nr < 0)
|  			goto out_free;
| 
| -- 
| 1.6.0.4
| 
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v12][PATCH 5/9] Add target_pids parameter to copy_process()
       [not found] ` <20091111044403.GE11393@suka>
@ 2009-11-11 22:40   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
	Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
	Pavel Emelyanov, linux-kernel

CC: LKML

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| 
| Subject: [v12][PATCH 5/9] Add target_pids parameter to copy_process()
| 
| Add a 'target_pids' parameter to copy_process().  The new parameter will be
| used in a follow-on patch when eclone() is implemented.
| 
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Serge Hallyn <serue@us.ibm.com>
| Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
|  kernel/fork.c |    7 ++++---
|  1 files changed, 4 insertions(+), 3 deletions(-)
| 
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 3f1dddf..c8a06de 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -975,12 +975,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
|  					unsigned long stack_size,
|  					int __user *child_tidptr,
|  					struct pid *pid,
| +					pid_t *target_pids,
|  					int trace)
|  {
|  	int retval;
|  	struct task_struct *p;
|  	int cgroup_callbacks_done = 0;
| -	pid_t *target_pids = NULL;
| 
|  	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
|  		return ERR_PTR(-EINVAL);
| @@ -1361,7 +1361,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
|  	struct pt_regs regs;
| 
|  	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
| -			    &init_struct_pid, 0);
| +			    &init_struct_pid, NULL, 0);
|  	if (!IS_ERR(task))
|  		init_idle(task, cpu);
| 
| @@ -1384,6 +1384,7 @@ long do_fork(unsigned long clone_flags,
|  	struct task_struct *p;
|  	int trace = 0;
|  	long nr;
| +	pid_t *target_pids = NULL;
| 
|  	/*
|  	 * Do some preliminary argument and permissions checking before we
| @@ -1424,7 +1425,7 @@ long do_fork(unsigned long clone_flags,
|  		trace = tracehook_prepare_clone(clone_flags);
| 
|  	p = copy_process(clone_flags, stack_start, regs, stack_size,
| -			 child_tidptr, NULL, trace);
| +			 child_tidptr, NULL, target_pids, trace);
|  	/*
|  	 * Do this prior waking up the new thread - the thread pointer
|  	 * might get invalid after that point, if the thread exits quickly.
| -- 
| 1.6.0.4
| 
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v12][PATCH 6/9] Check invalid clone flags
       [not found] ` <20091111044422.GF11393@suka>
@ 2009-11-11 22:40   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
	Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
	Pavel Emelyanov, linux-kernel

Cc: LKML

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| 
| Subject: [v12][PATCH 6/9] Check invalid clone flags
| 
| As pointed out by Oren Laadan, we want to ensure that unused bits in the
| clone-flags remain unused and available for future. To ensure this, define
| a mask of clone-flags and check the flags in the clone() system calls.
| 
| Changelog[v9]:
| 	- Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS
| 	  to avoid breaking any applications that may have set it. IOW, this
| 	  patch/check only applies to clone-flags bits 33 and higher.
| 
| Changelog[v8]:
| 	- New patch in set
| 
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
|  include/linux/sched.h |   12 ++++++++++++
|  kernel/fork.c         |    3 +++
|  2 files changed, 15 insertions(+), 0 deletions(-)
| 
| diff --git a/include/linux/sched.h b/include/linux/sched.h
| index 75e6e60..a4d2c23 100644
| --- a/include/linux/sched.h
| +++ b/include/linux/sched.h
| @@ -29,6 +29,18 @@
|  #define CLONE_NEWNET		0x40000000	/* New network namespace */
|  #define CLONE_IO		0x80000000	/* Clone io context */
| 
| +#define CLONE_UNUSED        	0x00001000	/* Can be reused ? */
| +
| +#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
| +				 CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\
| +				 CLONE_VFORK  | CLONE_PARENT | CLONE_THREAD  |\
| +				 CLONE_NEWNS  | CLONE_SYSVSEM | CLONE_SETTLS |\
| +				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
| +				 CLONE_DETACHED | CLONE_UNTRACED             |\
| +				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
| +				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
| +				 CLONE_NEWPID | CLONE_NEWNET | CLONE_IO)
| +
|  /*
|   * Scheduling policies
|   */
| diff --git a/kernel/fork.c b/kernel/fork.c
| index c8a06de..11f77ed 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -982,6 +982,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
|  	struct task_struct *p;
|  	int cgroup_callbacks_done = 0;
| 
| +	if (clone_flags & ~VALID_CLONE_FLAGS)
| +		return ERR_PTR(-EINVAL);
| +
|  	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
|  		return ERR_PTR(-EINVAL);
| 
| -- 
| 1.6.0.4
| 
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v12][PATCH 7/9] Define do_fork_with_pids()
       [not found] ` <20091111044438.GG11393@suka>
@ 2009-11-11 22:40   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
	Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
	Pavel Emelyanov, linux-kernel

cc: lkml

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| 
| Subject: [v12][PATCH 7/9] Define do_fork_with_pids()
| 
| do_fork_with_pids() is same as do_fork(), except that it takes an
| additional, 'pid_set', parameter. This parameter, currently unused,
| specifies the set of target pids of the process in each of its pid
| namespaces.
| 
| Changelog[v7]:
| 	- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
| 	  instead of 'struct pid_set *'.
| 
| Changelog[v6]:
| 	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
| 	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
| 	  is constant across architectures.
| 	- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.
| 
| Changelog[v4]:
| 	- Rename 'struct target_pid_set' to 'struct pid_set' since it may
| 	  be useful in other contexts.
| 
| Changelog[v3]:
| 	- Fix "long-line" warning from checkpatch.pl
| 
| Changelog[v2]:
| 	- To facilitate moving architecture-inpdendent code to kernel/fork.c
| 	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
| 	  rather than 'pid_t *' (next patch moves the arch-independent
| 	  code to kernel/fork.c)
| 
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Serge Hallyn <serue@us.ibm.com>
| Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
|  include/linux/sched.h |    3 +++
|  kernel/fork.c         |   17 +++++++++++++++--
|  2 files changed, 18 insertions(+), 2 deletions(-)
| 
| diff --git a/include/linux/sched.h b/include/linux/sched.h
| index a4d2c23..85e971a 100644
| --- a/include/linux/sched.h
| +++ b/include/linux/sched.h
| @@ -2153,6 +2153,9 @@ extern int disallow_signal(int);
| 
|  extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
|  extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
| +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
| +				unsigned long, int __user *, int __user *,
| +				unsigned int, pid_t __user *);
|  struct task_struct *fork_idle(int);
| 
|  extern void set_task_comm(struct task_struct *tsk, char *from);
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 11f77ed..210e841 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -1377,12 +1377,14 @@ struct task_struct * __cpuinit fork_idle(int cpu)
|   * It copies the process, and if successful kick-starts
|   * it and waits for it to finish using the VM if required.
|   */
| -long do_fork(unsigned long clone_flags,
| +long do_fork_with_pids(unsigned long clone_flags,
|  	      unsigned long stack_start,
|  	      struct pt_regs *regs,
|  	      unsigned long stack_size,
|  	      int __user *parent_tidptr,
| -	      int __user *child_tidptr)
| +	      int __user *child_tidptr,
| +	      unsigned int num_pids,
| +	      pid_t __user *upids)
|  {
|  	struct task_struct *p;
|  	int trace = 0;
| @@ -1485,6 +1487,17 @@ long do_fork(unsigned long clone_flags,
|  	return nr;
|  }
| 
| +long do_fork(unsigned long clone_flags,
| +	      unsigned long stack_start,
| +	      struct pt_regs *regs,
| +	      unsigned long stack_size,
| +	      int __user *parent_tidptr,
| +	      int __user *child_tidptr)
| +{
| +	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
| +			parent_tidptr, child_tidptr, 0, NULL);
| +}
| +
|  #ifndef ARCH_MIN_MMSTRUCT_ALIGN
|  #define ARCH_MIN_MMSTRUCT_ALIGN 0
|  #endif
| -- 
| 1.6.0.4
| 
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v12][PATCH 8/9] Define eclone() syscall
       [not found] ` <20091111044509.GH11393@suka>
@ 2009-11-11 22:40   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
	Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
	Pavel Emelyanov, linux-kernel

cc: lkml

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| 
| Subject: [v12][PATCH 8/9] Define eclone() syscall
| 
| Container restart requires that a task have the same pid it had when it was
| checkpointed. When containers are nested the tasks within the containers
| exist in multiple pid namespaces and hence have multiple pids to specify
| during restart.
| 
| eclone(), intended for use during restart, is the same as
| clone(), except that it takes a 'pids' paramter. This parameter lets
| caller choose specific pid numbers for the child process, in the
| process's active and ancestor pid namespaces. (Descendant pid namespaces
| in general don't matter since processes don't have pids in them anyway,
| but see comments in copy_target_pids() regarding CLONE_NEWPID).
| 
| eclone() also attempts to address a second limitation of the
| clone() system call. clone() is restricted to 32 clone flags and all but
| one of these are in use. If more new clone flags are needed, we will be
| forced to define a new variant of the clone() system call. To address
| this, eclone() allows at least 64 clone flags with some room
| for more if necessary.
| 
| To prevent unprivileged processes from misusing this interface,
| eclone() currently needs CAP_SYS_ADMIN, when the 'pids' parameter
| is non-NULL.
| 
| See Documentation/eclone in next patch for more details and an
| example of its usage.
| 
| NOTE:
| 	- System calls are restricted to 6 parameters and the number and sizes
| 	  of parameters needed for eclone() exceed 6 integers. The new
| 	  prototype works around this restriction while providing some
| 	  flexibility if eclone() needs to be further extended in the
| 	  future.
| TODO:
| 	- We should convert clone-flags to 64-bit value in all architectures.
| 	  Its probably best to do that as a separate patchset since clone_flags
| 	  touches several functions and that patchset seems independent of this
| 	  new system call.
| 
| Changelog[v12]:
| 	- [Serge Hallyn] Ignore ->child_stack_size if ->child_stack_base
| 	  is NULL.
| 	- [Oren Laadan, Serge Hallyn] Rename clone_with_pids() to eclone()
| Changelog[v11]:
| 	- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
| 	  code.
| 	- [Oren Laadan] Make args_size a parameter to system call and remove
| 	  it from 'struct clone_args'
| 
| Changelog[v10]:
| 	- Rename clone3() to clone_with_pids()
| 	- [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall
| 	  implementation
| 
| Changelog[v9]:
| 	- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
| 	  architectures split the new clone-flags into 'low' and 'high'
| 	  words and pass in the 'lower' flags as the first argument.
| 	  This would maintain similarity of the clone3() with clone()/
| 	  clone2(). Also has the side-effect of the name matching the
| 	  number of parameters :-)
| 	- [Roland McGrath] Rename structure to 'clone_args' and add a
| 	  'child_stack_size' field
| 
| Changelog[v8]
| 	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
| 	  must be 64-bit.
| 	- clone2() is in use in IA64. Rename system call to clone3().
| 
| Changelog[v7]:
| 	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
| 	  and group parameters into a new 'struct clone_struct' object.
| 
| Changelog[v6]:
| 	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
| 	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
| 	  is constant across architectures.
| 	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
| 	  'unum_pids < 0' check.
| 
| Changelog[v4]:
| 	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'
| 
| Changelog[v3]:
| 	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
| 	  in the target_pids[] list and setting it 0. See copy_target_pids()).
| 	- (Oren Laadan) Specified target pids should apply only to youngest
| 	  pid-namespaces (see copy_target_pids())
| 	- (Matt Helsley) Update patch description.
| 
| Changelog[v2]:
| 	- Remove unnecessary printk and add a note to callers of
| 	  copy_target_pids() to free target_pids.
| 	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
| 	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
| 	  'num_pids == 0' (fall back to normal clone()).
| 	- Move arch-independent code (sanity checks and copy-in of target-pids)
| 	  into kernel/fork.c and simplify sys_clone_with_pids()
| 
| Changelog[v1]:
| 	- Fixed some compile errors (had fixed these errors earlier in my
| 	  git tree but had not refreshed patches before emailing them)
| 
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
|  arch/x86/include/asm/syscalls.h    |    1 +
|  arch/x86/include/asm/unistd_32.h   |    1 +
|  arch/x86/kernel/entry_32.S         |    1 +
|  arch/x86/kernel/process_32.c       |   47 ++++++++++++++
|  arch/x86/kernel/syscall_table_32.S |    1 +
|  include/linux/sched.h              |    2 +
|  include/linux/types.h              |   11 +++
|  kernel/fork.c                      |  124 +++++++++++++++++++++++++++++++++++-
|  8 files changed, 187 insertions(+), 1 deletions(-)
| 
| diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
| index 372b76e..2cadb8e 100644
| --- a/arch/x86/include/asm/syscalls.h
| +++ b/arch/x86/include/asm/syscalls.h
| @@ -40,6 +40,7 @@ long sys_iopl(struct pt_regs *);
| 
|  /* kernel/process_32.c */
|  int sys_clone(struct pt_regs *);
| +int sys_eclone(struct pt_regs *);
|  int sys_execve(struct pt_regs *);
| 
|  /* kernel/signal.c */
| diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
| index 6fb3c20..030b121 100644
| --- a/arch/x86/include/asm/unistd_32.h
| +++ b/arch/x86/include/asm/unistd_32.h
| @@ -342,6 +342,7 @@
|  #define __NR_pwritev		334
|  #define __NR_rt_tgsigqueueinfo	335
|  #define __NR_perf_event_open	336
| +#define __NR_eclone		337
| 
|  #ifdef __KERNEL__
| 
| diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
| index c097e7d..7e7f3c8 100644
| --- a/arch/x86/kernel/entry_32.S
| +++ b/arch/x86/kernel/entry_32.S
| @@ -718,6 +718,7 @@ ptregs_##name: \
|  PTREGSCALL(iopl)
|  PTREGSCALL(fork)
|  PTREGSCALL(clone)
| +PTREGSCALL(eclone)
|  PTREGSCALL(vfork)
|  PTREGSCALL(execve)
|  PTREGSCALL(sigaltstack)
| diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
| index 4cf7956..be7c1a1 100644
| --- a/arch/x86/kernel/process_32.c
| +++ b/arch/x86/kernel/process_32.c
| @@ -445,6 +445,53 @@ int sys_clone(struct pt_regs *regs)
|  	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
|  }
| 
| +int sys_eclone(struct pt_regs *regs)
| +{
| +	int rc;
| +	int args_size;
| +	struct clone_args kca;
| +	unsigned long flags;
| +	int __user *parent_tid_ptr;
| +	int __user *child_tid_ptr;
| +	unsigned long __user child_stack;
| +	unsigned long stack_size;
| +	unsigned int flags_low;
| +	struct clone_args __user *uca;
| +	pid_t __user *pids;
| +
| +	flags_low = regs->bx;
| +	uca = (struct clone_args __user *)regs->cx;
| +	args_size = regs->dx;
| +	pids = (pid_t __user *)regs->di;
| +
| +	rc = fetch_clone_args_from_user(uca, args_size, &kca);
| +	if (rc)
| +		return rc;
| +
| +	/*
| +	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
| +	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
| +	 * 	 higher word(s) of 'flags':
| +	 *
| +	 * 		flags = (kca.clone_flags_high << 32) | flags_low;
| +	 */
| +	flags = flags_low;
| +	parent_tid_ptr = (int *)kca.parent_tid_ptr;
| +	child_tid_ptr = (int *)kca.child_tid_ptr;
| +	stack_size = (unsigned long)kca.child_stack_size;
| +
| +	child_stack = 0UL;
| +	if (kca.child_stack_base)
| +		child_stack = (unsigned long)kca.child_stack_base + stack_size;
| +
| +	if (!child_stack)
| +		child_stack = regs->sp;
| +
| +	return do_fork_with_pids(flags, child_stack, regs, stack_size,
| +				parent_tid_ptr, child_tid_ptr, kca.nr_pids,
| +				pids);
| +}
| +
|  /*
|   * sys_execve() executes a new program.
|   */
| diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
| index 0157cd2..2b6394c 100644
| --- a/arch/x86/kernel/syscall_table_32.S
| +++ b/arch/x86/kernel/syscall_table_32.S
| @@ -336,3 +336,4 @@ ENTRY(sys_call_table)
|  	.long sys_pwritev
|  	.long sys_rt_tgsigqueueinfo	/* 335 */
|  	.long sys_perf_event_open
| +	.long ptregs_eclone
| diff --git a/include/linux/sched.h b/include/linux/sched.h
| index 85e971a..b2a881d 100644
| --- a/include/linux/sched.h
| +++ b/include/linux/sched.h
| @@ -2153,6 +2153,8 @@ extern int disallow_signal(int);
| 
|  extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
|  extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
| +extern int fetch_clone_args_from_user(struct clone_args __user *, int,
| +				struct clone_args *);
|  extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
|  				unsigned long, int __user *, int __user *,
|  				unsigned int, pid_t __user *);
| diff --git a/include/linux/types.h b/include/linux/types.h
| index c42724f..0d08683 100644
| --- a/include/linux/types.h
| +++ b/include/linux/types.h
| @@ -204,6 +204,17 @@ struct ustat {
|  	char			f_fpack[6];
|  };
| 
| +struct clone_args {
| +	u64 clone_flags_high;
| +	u64 child_stack_base;
| +	u64 child_stack_size;
| +	u64 parent_tid_ptr;
| +	u64 child_tid_ptr;
| +	u32 nr_pids;
| +	u32 reserved0;
| +	u64 reserved1;
| +};
| +
|  #endif	/* __KERNEL__ */
|  #endif /*  __ASSEMBLY__ */
|  #endif /* _LINUX_TYPES_H */
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 210e841..cacc26b 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -1372,6 +1372,114 @@ struct task_struct * __cpuinit fork_idle(int cpu)
|  }
| 
|  /*
| + * If user specified any 'target-pids' in @upid_setp, copy them from
| + * user and return a pointer to a local copy of the list of pids. The
| + * caller must free the list, when they are done using it.
| + *
| + * If user did not specify any target pids, return NULL (caller should
| + * treat this like normal clone).
| + *
| + * On any errors, return the error code
| + */
| +static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
| +{
| +	int j;
| +	int rc;
| +	int size;
| +	int knum_pids;		/* # of pids needed in kernel */
| +	pid_t *target_pids;
| +
| +	if (!unum_pids)
| +		return NULL;
| +
| +	knum_pids = task_pid(current)->level + 1;
| +	if (unum_pids > knum_pids)
| +		return ERR_PTR(-EINVAL);
| +
| +	/*
| +	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
| +	 * and set it to 0. This last entry in target_pids[] corresponds to the
| +	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
| +	 * specified. If CLONE_NEWPID was not specified, this last entry will
| +	 * simply be ignored.
| +	 */
| +	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
| +	if (!target_pids)
| +		return ERR_PTR(-ENOMEM);
| +
| +	/*
| +	 * A process running in a level 2 pid namespace has three pid namespaces
| +	 * and hence three pid numbers. If this process is checkpointed,
| +	 * information about these three namespaces are saved. We refer to these
| +	 * namespaces as 'known namespaces'.
| +	 *
| +	 * If this checkpointed process is however restarted in a level 3 pid
| +	 * namespace, the restarted process has an extra ancestor pid namespace
| +	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
| +	 *
| +	 * During restart, the process requests specific pids for its 'known
| +	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
| +	 *
| +	 * Since the requested-pids correspond to 'known namespaces' and since
| +	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
| +	 * namespaces', copy requested pids to the back-end of target_pids[]
| +	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
| +	 * Any entries in target_pids[] not corresponding to a requested pid
| +	 * will be set to zero and kernel assigns a pid in those namespaces.
| +	 *
| +	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
| +	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
| +	 * 	 the order is:
| +	 *
| +	 * 		- pids for 'unknown-namespaces' (if any)
| +	 * 		- pids for 'known-namespaces' (requested pids)
| +	 * 		- 0 in the last entry (for CLONE_NEWPID).
| +	 */
| +	j = knum_pids - unum_pids;
| +	size = unum_pids * sizeof(pid_t);
| +
| +	rc = copy_from_user(&target_pids[j], upids, size);
| +	if (rc) {
| +		rc = -EFAULT;
| +		goto out_free;
| +	}
| +
| +	return target_pids;
| +
| +out_free:
| +	kfree(target_pids);
| +	return ERR_PTR(rc);
| +}
| +
| +int
| +fetch_clone_args_from_user(struct clone_args __user *uca, int args_size,
| +			struct clone_args *kca)
| +{
| +	int rc;
| +
| +	/*
| +	 * TODO: If size of clone_args is not what the kernel expects, it
| +	 * 	 could be that kernel is newer and has an extended structure.
| +	 * 	 When that happens, this check needs to be smarter.  For now,
| +	 * 	 assume exact match.
| +	 */
| +	if (args_size != sizeof(struct clone_args))
| +		return -EINVAL;
| +
| +	rc = copy_from_user(kca, uca, args_size);
| +	if (rc)
| +		return -EFAULT;
| +
| +	/*
| +	 * To avoid future compatibility issues, ensure unused fields are 0.
| +	 */
| +	if (kca->reserved0 || kca->reserved1 || kca->clone_flags_high)
| +		return -EINVAL;
| +
| +	return 0;
| +}
| +
| +/*
|   *  Ok, this is the main fork-routine.
|   *
|   * It copies the process, and if successful kick-starts
| @@ -1389,7 +1497,7 @@ long do_fork_with_pids(unsigned long clone_flags,
|  	struct task_struct *p;
|  	int trace = 0;
|  	long nr;
| -	pid_t *target_pids = NULL;
| +	pid_t *target_pids;
| 
|  	/*
|  	 * Do some preliminary argument and permissions checking before we
| @@ -1423,6 +1531,16 @@ long do_fork_with_pids(unsigned long clone_flags,
|  		}
|  	}
| 
| +	target_pids = copy_target_pids(num_pids, upids);
| +	if (target_pids) {
| +		if (IS_ERR(target_pids))
| +			return PTR_ERR(target_pids);
| +
| +		nr = -EPERM;
| +		if (!capable(CAP_SYS_ADMIN))
| +			goto out_free;
| +	}
| +
|  	/*
|  	 * When called from kernel_thread, don't do user tracing stuff.
|  	 */
| @@ -1484,6 +1602,10 @@ long do_fork_with_pids(unsigned long clone_flags,
|  	} else {
|  		nr = PTR_ERR(p);
|  	}
| +
| +out_free:
| +	kfree(target_pids);
| +
|  	return nr;
|  }
| 
| -- 
| 1.6.0.4
| 
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v12][PATCH 9/9] Document eclone() syscall
       [not found] ` <20091111044527.GI11393@suka>
@ 2009-11-11 22:41   ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
	Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
	Pavel Emelyanov, linux-kernel

CC: LKML

Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
| 
| Subject: [v12][PATCH 9/9] Document eclone() syscall
| 
| This gives a brief overview of the eclone() system call.  We should
| eventually describe more details in existing clone(2) man page or in
| a new man page.
| 
| Changelog[v12]:
| 	- [Serge Hallyn] Fix/simplify stack-setup in the example code
| 	- [Serge Hallyn, Oren Laadan] Rename syscall to eclone()
| 
| Changelog[v11]:
| 	- [Dave Hansen] Move clone_args validation checks to arch-indpendent
| 	  code.
| 	- [Oren Laadan] Make args_size a parameter to system call and remove
| 	  it from 'struct clone_args'
| 	- [Oren Laadan] Fix some typos and clarify the order of pids in the
| 	  @pids parameter.
| 
| Changelog[v10]:
| 	- Rename clone3() to clone_with_pids() and fix some typos.
| 	- Modify example to show usage with the ptregs implementation.
| Changelog[v9]:
| 	- [Pavel Machek]: Fix an inconsistency and rename new file to
| 	  Documentation/clone3.
| 	- [Roland McGrath, H. Peter Anvin] Updates to description and
| 	  example to reflect new prototype of clone3() and the updated/
| 	  renamed 'struct clone_args'.
| 
| Changelog[v8]:
| 	- clone2() is already in use in IA64. Rename syscall to clone3()
| 	- Add notes to say that we return -EINVAL if invalid clone flags
| 	  are specified or if the reserved fields are not 0.
| Changelog[v7]:
| 	- Rename clone_with_pids() to clone2()
| 	- Changes to reflect new prototype of clone2() (using clone_struct).
| 
| Signed-off-by: Sukadev Bhattiprolu <sukadev@vnet.linux.ibm.com>
| Acked-by: Oren Laadan  <orenl@cs.columbia.edu>
| ---
|  Documentation/eclone |  345 ++++++++++++++++++++++++++++++++++++++++++++++++++
|  1 files changed, 345 insertions(+), 0 deletions(-)
|  create mode 100644 Documentation/eclone
| 
| diff --git a/Documentation/eclone b/Documentation/eclone
| new file mode 100644
| index 0000000..9b4ec84
| --- /dev/null
| +++ b/Documentation/eclone
| @@ -0,0 +1,345 @@
| +
| +struct clone_args {
| +	u64 clone_flags_high;
| +	u64 child_stack_base;
| +	u64 child_stack_size;
| +	u64 parent_tid_ptr;
| +	u64 child_tid_ptr;
| +	u32 nr_pids;
| +	u32 reserved0;
| +	u64 reserved1;
| +};
| +
| +
| +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
| +		pid_t * __user pids)
| +
| +	In addition to doing everything that clone() system call does, the
| +	eclone() system call:
| +
| +		- allows additional clone flags (31 of 32 bits in the flags
| +		  parameter to clone() are in use)
| +
| +		- allows user to specify a pid for the child process in its
| +		  active and ancestor pid namespaces.
| +
| +	This system call is meant to be used when restarting an application
| +	from a checkpoint. Such restart requires that the processes in the
| +	application have the same pids they had when the application was
| +	checkpointed. When containers are nested, the processes within the
| +	containers exist in multiple pid namespaces and hence have multiple
| +	pids to specify during restart.
| +
| +	The @flags_low parameter is identical to the 'clone_flags' parameter
| +	in existing clone() system call.
| +
| +	The fields in 'struct clone_args' are meant to be used as follows:
| +
| +	u64 clone_flags_high:
| +
| +		When eclone() supports more than 32 flags, the additional bits
| +		in the clone_flags should be specified in this field. This
| +		field is currently unused and must be set to 0.
| +
| +	u64 child_stack_base;
| +	u64 child_stack_size;
| +
| +		These two fields correspond to the 'child_stack' fields in
| +		clone() and clone2() (on IA64) system calls.
| +
| +	u64 parent_tid_ptr;
| +	u64 child_tid_ptr;
| +
| +		These two fields correspond to the 'parent_tid_ptr' and
| +		'child_tid_ptr' fields in the clone() system call
| +
| +	u32 nr_pids;
| +
| +		nr_pids specifies the number of pids in the @pids array
| +		parameter to eclone() (see below). nr_pids should not exceed
| +		the current nesting level of the calling process (i.e if the
| +		process is in init_pid_ns, nr_pids must be 1, if process is
| +		in a pid namespace that is a child of init-pid-ns, nr_pids
| +		cannot exceed 2, and so on).
| +
| +	u32 reserved0;
| +	u64 reserved1;
| +
| +		These fields are intended to extend the functionality of the
| +		eclone() in the future, while preserving backward compatibility.
| +		They must be set to 0 for now.
| +
| +	The @cargs_size parameter specifes the sizeof(struct clone_args) and
| +	is intended to enable extending this structure in the future, while
| +	preserving backward compatibility.  For now, this field must be set
| +	to the sizeof(struct clone_args) and this size must match the kernel's
| +	view of the structure.
| +
| +	The @pids parameter defines the set of pids that should be assigned to
| +	the child process in its active and ancestor pid namespaces. The
| +	descendant pid namespaces do not matter since a process does not have a
| +	pid in descendant namespaces, unless the process is in a new pid
| +	namespace in which case the process is a container-init (and must have
| +	the pid 1 in that namespace).
| +
| +	See CLONE_NEWPID section of clone(2) man page for details about pid
| +	namespaces.
| +
| +	If a pid in the @pids list is 0, the kernel will assign the next
| +	available pid in the pid namespace.
| +
| +	If a pid in the @pids list is non-zero, the kernel tries to assign
| +	the specified pid in that namespace.  If that pid is already in use
| +	by another process, the system call fails (see EBUSY below).
| +
| +	The order of pids in @pids is oldest in pids[0] to youngest pid
| +	namespace in pids[nr_pids-1]. If the number of pids specified in the
| +	@pids list is fewer than the nesting level of the process, the pids
| +	are applied from youngest namespace. i.e if the process is nested in
| +	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
| +	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
| +	have a pid of '0' (the kernel will assign a pid in those namespaces).
| +
| +	On success, the system call returns the pid of the child process in
| +	the parent's active pid namespace.
| +
| +	On failure, eclone() returns -1 and sets 'errno' to one of following
| +	values (the child process is not created).
| +
| +	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
| +		specify the pids in this call (if pids are not specifed
| +		CAP_SYS_ADMIN is not required).
| +
| +	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
| +		the current nesting level of parent process
| +
| +	EINVAL	Not all specified clone-flags are valid.
| +
| +	EINVAL	The reserved fields in the clone_args argument are not 0.
| +
| +	EBUSY	A requested pid is in use by another process in that namespace.
| +
| +---
| +/*
| + * Example eclone() usage - Create a child process with pid CHILD_TID1 in
| + * the current pid namespace. The child gets the usual "random" pid in any
| + * ancestor pid namespaces.
| + */
| +#include <stdio.h>
| +#include <stdlib.h>
| +#include <string.h>
| +#include <signal.h>
| +#include <errno.h>
| +#include <unistd.h>
| +#include <wait.h>
| +#include <sys/syscall.h>
| +
| +#define __NR_eclone		337
| +#define CLONE_NEWPID            0x20000000
| +#define CLONE_CHILD_SETTID      0x01000000
| +#define CLONE_PARENT_SETTID     0x00100000
| +#define CLONE_UNUSED		0x00001000
| +
| +#define STACKSIZE		8192
| +
| +typedef unsigned long long u64;
| +typedef unsigned int u32;
| +typedef int pid_t;
| +struct clone_args {
| +	u64 clone_flags_high;
| +
| +	u64 child_stack_base;
| +	u64 child_stack_size;
| +
| +	u64 parent_tid_ptr;
| +	u64 child_tid_ptr;
| +
| +	u32 nr_pids;
| +
| +	u32 reserved0;
| +	u64 reserved1;
| +};
| +
| +#define exit		_exit
| +
| +/*
| + * Following eclone() is based on code posted by Oren Laadan at:
| + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
| + */
| +#if defined(__i386__) && defined(__NR_eclone)
| +
| +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
| +		int *pids)
| +{
| +	long retval;
| +
| +	__asm__  __volatile__(
| +		 "movl %0, %%ebx\n\t"		/* flags -> 1st (ebx) */
| +		 "movl %1, %%ecx\n\t"		/* clone_args -> 2nd (ecx)*/
| +		 "movl %2, %%edx\n\t"		/* args_size -> 3rd (edx) */
| +		 "movl %3, %%edi\n\t"		/* pids -> 4th (edi)*/
| +		 "pushl %%ebp\n\t"		/* save value of ebp */
| +		:
| +		:"b" (flags_low),
| +		 "c" (clone_args),
| +		 "d" (args_size),
| +		 "D" (pids)
| +		);
| +
| +	__asm__ __volatile__(
| +		 "int $0x80\n\t"	/* Linux/i386 system call */
| +		 "testl %0,%0\n\t"	/* check return value */
| +		 "jne 1f\n\t"		/* jump if parent */
| +		 "popl %%ebx\n\t"	/* get subthread function */
| +		 "call *%%ebx\n\t"	/* start subthread function */
| +		 "movl %2,%0\n\t"
| +		 "int $0x80\n"		/* exit system call: exit subthread */
| +		 "1:\n\t"
| +		 "popl %%ebp\t"		/* restore parent's ebp */
| +		:"=a" (retval)
| +		:"0" (__NR_eclone), "i" (__NR_exit)
| +		:"ebx", "ecx", "edx"
| +		);
| +
| +	if (retval < 0) {
| +		errno = -retval;
| +		retval = -1;
| +	}
| +	return retval;
| +}
| +
| +/*
| + * Allocate a stack for the clone-child and arrange to have the child
| + * execute @child_fn with @child_arg as the argument.
| + */
| +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
| +{
| +	void *stack_base;
| +	void **stack_top;
| +	int frame_size;
| +	
| +	/*
| +	 * Allocate extra space to push @child_arg and @child_fn near
| +	 * top of stack
| +	 */
| +	frame_size= 12;
| +
| +	stack_base = malloc(size + frame_size);
| +	if (!stack_base) {
| +		perror("malloc()");
| +		exit(1);
| +	}
| +
| +	stack_top = (void **)((char *)stack_base + (size + frame_size - 4));
| +	*--stack_top = child_arg;
| +	*--stack_top = child_fn;
| +
| +	return stack_base;
| +}
| +#endif
| +
| +/* gettid() is a bit more useful than getpid() when messing with clone() */
| +int gettid()
| +{
| +	int rc;
| +
| +	rc = syscall(__NR_gettid, 0, 0, 0);
| +	if (rc < 0) {
| +		printf("rc %d, errno %d\n", rc, errno);
| +		exit(1);
| +	}
| +	return rc;
| +}
| +
| +#define CHILD_TID1	377
| +#define CHILD_TID2	1177
| +#define CHILD_TID3	2799
| +
| +struct clone_args clone_args;
| +void *child_arg = &clone_args;
| +int child_tid;
| +
| +int do_child(void *arg)
| +{
| +	struct clone_args *cs = (struct clone_args *)arg;
| +	int ctid;
| +
| +	/* Verify we pushed the arguments correctly on the stack... */
| +	if (arg != child_arg)  {
| +		printf("Child: Incorrect child arg pointer, expected %p,"
| +				"actual %p\n", child_arg, arg);
| +		exit(1);
| +	}
| +
| +	/* ... and that we got the thread-id we expected */
| +	ctid = *((int *)cs->child_tid_ptr);
| +	if (ctid != CHILD_TID1) {
| +		printf("Child: Incorrect child tid, expected %d, actual %d\n",
| +				CHILD_TID1, ctid);
| +		exit(1);
| +	} else {
| +		printf("Child got the expected tid, %d\n", gettid());
| +	}
| +	sleep(3);
| +
| +	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
| +	exit(0);
| +}
| +
| +static int do_clone(int (*child_fn)(void *), void *child_arg,
| +		unsigned int flags_low, int nr_pids, pid_t *pids_list)
| +{
| +	int rc;
| +	void *stack;
| +	struct clone_args *ca = &clone_args;
| +	int args_size;
| +
| +	stack = setup_stack(child_fn, child_arg, STACKSIZE);
| +
| +	memset(ca, 0, sizeof(*ca));
| +
| +	ca->child_stack_base	= (u64)stack;
| +	ca->child_stack_size	= (u64)STACKSIZE;
| +	ca->child_tid_ptr	= (u64)&child_tid;
| +	ca->nr_pids		= nr_pids;
| +
| +	args_size = sizeof(struct clone_args);
| +	rc = eclone(flags_low, ca, args_size, pids_list);
| +
| +	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
| +				rc, errno);
| +	return rc;
| +}
| +
| +/*
| + * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
| + * The test case creates a child in the current pid namespace and uses only
| + * the first value, CHILD_TID1 (note that nr_pids == 1).
| + */
| +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
| +main()
| +{
| +	int rc, pid, ret, status;
| +	unsigned long flags;
| +	int nr_pids = 1;
| +
| +	flags = SIGCHLD|CLONE_CHILD_SETTID;
| +
| +	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
| +
| +	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
| +
| +	rc = waitpid(pid, &status, __WALL);
| +	if (rc < 0) {
| +		printf("waitpid(): rc %d, error %d\n", rc, errno);
| +	} else {
| +		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
| +			 gettid(), rc, status);
| +
| +		if (WIFEXITED(status)) {
| +			printf("\t EXITED, %d\n", WEXITSTATUS(status));
| +		} else if (WIFSIGNALED(status)) {
| +			printf("\t SIGNALED, %d\n", WTERMSIG(status));
| +		}
| +	}
| +}
| -- 
| 1.6.0.4
| 
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-11-11 22:40 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20091111043440.GA9377@suka>
2009-11-11 22:38 ` [v12][PATCH 0/9] Implement eclone() syscall Sukadev Bhattiprolu
     [not found] ` <20091111044250.GA11393@suka>
2009-11-11 22:38   ` [v12][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
     [not found] ` <20091111044313.GB11393@suka>
2009-11-11 22:39   ` [v12][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
     [not found] ` <20091111044329.GC11393@suka>
2009-11-11 22:39   ` [v12][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
     [not found] ` <20091111044347.GD11393@suka>
2009-11-11 22:39   ` [v12][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
     [not found] ` <20091111044403.GE11393@suka>
2009-11-11 22:40   ` [v12][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
     [not found] ` <20091111044422.GF11393@suka>
2009-11-11 22:40   ` [v12][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
     [not found] ` <20091111044438.GG11393@suka>
2009-11-11 22:40   ` [v12][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
     [not found] ` <20091111044509.GH11393@suka>
2009-11-11 22:40   ` [v12][PATCH 8/9] Define eclone() syscall Sukadev Bhattiprolu
     [not found] ` <20091111044527.GI11393@suka>
2009-11-11 22:41   ` [v12][PATCH 9/9] Document " Sukadev Bhattiprolu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox