Linux Container Development
 help / color / mirror / Atom feed
From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
To: Sukadev Bhattiprolu
	<sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Cc: Containers
	<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
	"David C. Hansen"
	<haveblue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC][v3][PATCH 7/7] Define clone_with_pids syscall
Date: Sun, 31 May 2009 13:59:51 -0400	[thread overview]
Message-ID: <4A22C597.8060207@cs.columbia.edu> (raw)
In-Reply-To: <20090531000350.GF4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>


the code looks good. A couple of nits about comments, below.

Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>

Sukadev Bhattiprolu wrote:
> Serge,
> 
> I have removed restriction on CLONE_NEWPID and may need a new ack.
> 
> Sukadev
> ---
> From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> Date: Mon, 4 May 2009 01:17:45 -0700
> Subject: [RFC][v3][PATCH 7/7] Define clone_with_pids syscall
> 
> Container restart requires that a task have the same pid it had when it was
> checkpointed. When containers are nested the tasks within the containers
> exist in multiple pid namespaces and hence have multiple pids to specify
> during restart.
> 
> clone_with_pids(), intended for use during restart, is the same as clone(),
> except that it takes a 'target_pid_set' paramter. This parameter lets caller
> choose specific pid numbers for the child process, in the process's active
> and ancestor pid namespaces. (Descendant pid namespaces in general don't
> matter since processes don't have pids in them anyway, but see comments
> in copy_target_pids() regarding CLONE_NEWPID).
> 
> Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
> prevent unprivileged processes from misusing this interface.
> 
> Call clone_with_pids as follows:
> 
> 	pid_t pids[] = { 0, 77, 99 };
> 	struct target_pid_set pid_set;
> 
> 	pid_set.num_pids = sizeof(pids) / sizeof(int);
> 	pid_set.target_pids = &pids;
> 
> 	syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);
> 
> If a target-pid is 0, the kernel continues to assign a pid for the process in
> that namespace. In the above example, pids[0] is 0, meaning the kernel will
> assign next available pid to the process in init_pid_ns. But kernel will assign
> pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
> 77 or 99 are taken, the system call fails with -EBUSY.
> 
> If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
> the system call fails with -EINVAL.

Perhaps also add:
" If 'pid_set.num_pids' falls short of the current nesting level of
  pid namespaces, the syscall call assumes extra 0's at the head of
  the array."

> 
> Its mostly an exploratory patch seeking feedback on the interface.
> 
> NOTE:
> 	Compared to clone(), clone_with_pids() needs to pass in two more
> 	pieces of information:
> 
> 		- number of pids in the set
> 		- user buffer containing the list of pids.
> 
> 	But since clone() already takes 5 parameters, use a 'struct
> 	target_pid_set'.
> 
> TODO:
> 	- Gently tested.
> 	- May need additional sanity checks in do_fork_with_pids().
> 
> Changelog[v3]:
> 	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
> 	  in the target_pids[] list and setting it 0. See copy_target_pids()).
> 	- (Oren Laadan) Specified target pids should apply only to youngest
> 	  pid-namespaces (see copy_target_pids())
> 	- (Matt Helsley) Update patch description.
> 
> Changelog[v2]:
> 	- Remove unnecessary printk and add a note to callers of
> 	  copy_target_pids() to free target_pids.
> 	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
> 	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
> 	  'num_pids == 0' (fall back to normal clone()).
> 	- Move arch-independent code (sanity checks and copy-in of target-pids)
> 	  into kernel/fork.c and simplify sys_clone_with_pids()
> 
> Changelog[v1]:
> 	- Fixed some compile errors (had fixed these errors earlier in my
> 	  git tree but had not refreshed patches before emailing them)
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> ---
>  arch/x86/include/asm/syscalls.h    |    1 +
>  arch/x86/include/asm/unistd_32.h   |    1 +
>  arch/x86/kernel/entry_32.S         |    1 +
>  arch/x86/kernel/process_32.c       |   21 +++++++
>  arch/x86/kernel/syscall_table_32.S |    1 +
>  kernel/fork.c                      |  108 +++++++++++++++++++++++++++++++++++-
>  6 files changed, 132 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
> index 7043408..1fdc149 100644
> --- a/arch/x86/include/asm/syscalls.h
> +++ b/arch/x86/include/asm/syscalls.h
> @@ -31,6 +31,7 @@ asmlinkage int sys_get_thread_area(struct user_desc __user *);
>  /* kernel/process_32.c */
>  int sys_fork(struct pt_regs *);
>  int sys_clone(struct pt_regs *);
> +int sys_clone_with_pids(struct pt_regs *);
>  int sys_vfork(struct pt_regs *);
>  int sys_execve(struct pt_regs *);
>  
> diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
> index 6e72d74..90f906f 100644
> --- a/arch/x86/include/asm/unistd_32.h
> +++ b/arch/x86/include/asm/unistd_32.h
> @@ -340,6 +340,7 @@
>  #define __NR_inotify_init1	332
>  #define __NR_preadv		333
>  #define __NR_pwritev		334
> +#define __NR_clone_with_pids	335
>  
>  #ifdef __KERNEL__
>  
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index c929add..ee92b0d 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -707,6 +707,7 @@ ptregs_##name: \
>  PTREGSCALL(iopl)
>  PTREGSCALL(fork)
>  PTREGSCALL(clone)
> +PTREGSCALL(clone_with_pids)
>  PTREGSCALL(vfork)
>  PTREGSCALL(execve)
>  PTREGSCALL(sigaltstack)
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 76f8f84..1efc3de 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -445,6 +445,27 @@ int sys_clone(struct pt_regs *regs)
>  	return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
>  }
>  
> +int sys_clone_with_pids(struct pt_regs *regs)
> +{
> +	unsigned long clone_flags;
> +	unsigned long newsp;
> +	int __user *parent_tidptr;
> +	int __user *child_tidptr;
> +	void __user *upid_setp;
> +
> +	clone_flags = regs->bx;
> +	newsp = regs->cx;
> +	parent_tidptr = (int __user *)regs->dx;
> +	child_tidptr = (int __user *)regs->di;
> +	upid_setp = (void __user *)regs->bp;
> +
> +	if (!newsp)
> +		newsp = regs->sp;
> +
> +	return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr,
> +			child_tidptr, upid_setp);
> +}
> +
>  /*
>   * sys_execve() executes a new program.
>   */
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index ff5c873..94c1a58 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -334,3 +334,4 @@ ENTRY(sys_call_table)
>  	.long sys_inotify_init1
>  	.long sys_preadv
>  	.long sys_pwritev
> +	.long ptregs_clone_with_pids	/* 335 */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a16ef7b..7738aed 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1335,6 +1335,97 @@ struct task_struct * __cpuinit fork_idle(int cpu)
>  }
>  
>  /*
> + * If user specified any 'target-pids' in @upid_setp, copy them from
> + * user and return a pointer to a local copy of the list of pids. The
> + * caller must free the list, when they are done using it.
> + *
> + * If user did not specify any target pids, return NULL (caller should
> + * treat this like normal clone).
> + *
> + * On any errors, return the error code
> + */
> +static pid_t *copy_target_pids(void __user *upid_setp)
> +{
> +	int j;
> +	int rc;
> +	int size;
> +	int unum_pids;		/* # of pids specified by user */
> +	int knum_pids;		/* # of pids needed in kernel */
> +	pid_t *target_pids;
> +	struct target_pid_set pid_set;
> +
> +	if (!upid_setp)
> +		return NULL;
> +
> +	rc = copy_from_user(&pid_set, upid_setp, sizeof(pid_set));
> +	if (rc)
> +		return ERR_PTR(-EFAULT);
> +
> +	unum_pids = pid_set.num_pids;
> +	knum_pids = task_pid(current)->level + 1;
> +
> +	if (!unum_pids)
> +		return NULL;
> +
> +	if (unum_pids < 0 || unum_pids > knum_pids)
> +		return ERR_PTR(-EINVAL);
> +
> +	/*
> +	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
> +	 * and set it to 0. This last entry in target_pids[] corresponds to the
> +	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
> +	 * specified. If CLONE_NEWPID was not specified, this last entry will
> +	 * simply be ignored.
> +	 */
> +	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
> +	if (!target_pids)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/*
> +	 * A process running in a level 2 pid namespace has three pid namespaces
> +	 * and hence three pid numbers. If this process is checkpointed,
> +	 * information about these three namespaces are saved. We refer to these
> +	 * namespaces as 'known namespaces'.

The number of levels saved at checkpoint depends not only on the
checkpointee, but also on the checkpointer's nesting level.

Perhaps:
  [...and hence three pid numbers.] The process can be checkpointed
  from (its) level 0, 1 or 2. If this process is checkpointed from
  level 1, information about two namespaces is saved. [We refer ...]

> +	 *
> +	 * If this checkpointed process is however restarted in a level 3 pid
> +	 * namespace, the restarted process has an extra ancestor pid namespace
> +	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
> +	 *
> +	 * During restart, the process requests specific pids for its 'known
> +	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
> +	 *
> +	 * Since the requested-pids correspond to 'known namespaces' and since
> +	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
> +	 * namespaces', copy requested pids to the back-end of target_pids[]
> +	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
> +	 * Any entries in target_pids[] not corresponding to a requested pid
> +	 * will be set to zero and kernel assigns a pid in those namespaces.
> +	 *
> +	 * NOTE: The order of pids in target_pids[] is oldest pid namespace to
> +	 * 	 youngest (target_pids[0] corresponds to init_pid_ns). i.e.
> +	 * 	 the order is:
> +	 *
> +	 * 		- pids for 'unknown-namespaces' (if any)
> +	 * 		- pids for 'known-namespaces' (requested pids)
> +	 * 		- 0 in the last entry (for CLONE_NEWPID).
> +	 */
> +	j = knum_pids - unum_pids;
> +	size = unum_pids * sizeof(pid_t);
> +
> +	rc = copy_from_user(&target_pids[j], pid_set.target_pids, size);
> +	if (rc) {
> +		rc = -EFAULT;
> +		goto out_free;
> +	}
> +
> +	return target_pids;
> +
> +out_free:
> +	kfree(target_pids);
> +	return ERR_PTR(rc);
> +}
> +
> +/*
>   *  Ok, this is the main fork-routine.
>   *
>   * It copies the process, and if successful kick-starts
> @@ -1351,7 +1442,7 @@ long do_fork_with_pids(unsigned long clone_flags,
>  	struct task_struct *p;
>  	int trace = 0;
>  	long nr;
> -	pid_t *target_pids = NULL;
> +	pid_t *target_pids;
>  
>  	/*
>  	 * Do some preliminary argument and permissions checking before we
> @@ -1385,6 +1476,17 @@ long do_fork_with_pids(unsigned long clone_flags,
>  		}
>  	}
>  
> +	target_pids = copy_target_pids(pid_setp);
> +
> +	if (target_pids) {
> +		if (IS_ERR(target_pids))
> +			return PTR_ERR(target_pids);
> +
> +		nr = -EPERM;
> +		if (!capable(CAP_SYS_ADMIN))
> +			goto out_free;
> +	}
> +
>  	/*
>  	 * When called from kernel_thread, don't do user tracing stuff.
>  	 */
> @@ -1446,6 +1548,10 @@ long do_fork_with_pids(unsigned long clone_flags,
>  	} else {
>  		nr = PTR_ERR(p);
>  	}
> +
> +out_free:
> +	kfree(target_pids);
> +
>  	return nr;
>  }
>  

  parent reply	other threads:[~2009-05-31 17:59 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-30 23:57 [RFC][v3][PATCH 1/7] Factor out code to allocate pidmap page Sukadev Bhattiprolu
     [not found] ` <20090530235714.GA4083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-31  0:01   ` [RFC][v3][PATCH 2/7] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
     [not found]     ` <20090531000115.GA4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01  8:17       ` Amerigo Wang
2009-05-31  0:01   ` [RFC][v3][PATCH 3/7] Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
     [not found]     ` <20090531000144.GB4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01  8:30       ` Amerigo Wang
2009-05-31  0:02   ` [RFC][v3][PATCH 4/7] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
     [not found]     ` <20090531000220.GC4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01  8:34       ` Amerigo Wang
     [not found]         ` <20090601083419.GE4381-+dguKlz9DXUf7BdofF/totBPR1lH4CV8@public.gmane.org>
2009-06-01 20:52           ` Sukadev Bhattiprolu
     [not found]             ` <20090601205233.GB1812-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01 21:01               ` Sukadev Bhattiprolu
2009-05-31  0:02   ` [RFC][v3][PATCH 5/7] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
     [not found]     ` <20090531000237.GD4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01  8:40       ` Amerigo Wang
2009-05-31  0:02   ` [RFC][v3][PATCH 6/7] Define do_fork_with_pids() Sukadev Bhattiprolu
     [not found]     ` <20090531000255.GE4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01  8:19       ` Amerigo Wang
     [not found]         ` <20090601081929.GC4381-+dguKlz9DXUf7BdofF/totBPR1lH4CV8@public.gmane.org>
2009-06-03  0:35           ` Sukadev Bhattiprolu
     [not found]             ` <20090603003522.GA22704-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-04  3:07               ` Oren Laadan
     [not found]                 ` <Pine.LNX.4.64.0906032306240.25421-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
2009-06-04  7:21                   ` Amerigo Wang
     [not found]                     ` <20090604072112.GA6856-+dguKlz9DXUf7BdofF/totBPR1lH4CV8@public.gmane.org>
2009-06-04 15:41                       ` Sukadev Bhattiprolu
2009-05-31  0:03   ` [RFC][v3][PATCH 7/7] Define clone_with_pids syscall Sukadev Bhattiprolu
     [not found]     ` <20090531000350.GF4191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-31 17:59       ` Oren Laadan [this message]
2009-06-01 15:16       ` Serge E. Hallyn
     [not found]         ` <20090601151650.GA20295-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01 16:30           ` Oren Laadan
     [not found]             ` <4A240213.70001-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-06-01 16:42               ` Serge E. Hallyn
     [not found]                 ` <20090601164232.GA23252-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01 16:54                   ` Oren Laadan
     [not found]                     ` <4A2407B2.8030304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-06-01 17:19                       ` Serge E. Hallyn
     [not found]                         ` <20090601171943.GA23878-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-01 19:35                           ` Oren Laadan
     [not found]                             ` <4A242D66.2070907-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-06-01 19:54                               ` Serge E. Hallyn
2009-06-01 16:58               ` Oren Laadan
2009-06-13 18:18       ` Sukadev Bhattiprolu
     [not found]         ` <20090613181838.GA2775-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-06-13 20:22           ` Oren Laadan
2009-06-01  8:16   ` [RFC][v3][PATCH 1/7] Factor out code to allocate pidmap page Amerigo Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A22C597.8060207@cs.columbia.edu \
    --to=orenl-eqauephvms7envbuuze7ea@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=haveblue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org \
    --cc=sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox