Linux Container Development
 help / color / mirror / Atom feed
From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
To: Sukadev Bhattiprolu
	<sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Cc: Containers
	<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
	"David C. Hansen"
	<haveblue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Subject: Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall
Date: Fri, 29 May 2009 01:54:55 -0400	[thread overview]
Message-ID: <4A1F78AF.6030404@cs.columbia.edu> (raw)
In-Reply-To: <20090529054645.GA3344-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>



Sukadev Bhattiprolu wrote:
> Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote:
> | 
> | 
> | Sukadev Bhattiprolu wrote:
> | > Here is an updated patch with hopefully some useful comments on why
> | > we copy to the end of target_pids[] (comments were harder to write
> | > than the code :-)
> | > 
> | > ---
> | > From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> | > Date: Mon, 4 May 2009 01:17:45 -0700
> | > Subject: [PATCH 7/7] Define clone_with_pids syscall
> | > 
> | > Container restart requires that a task have the same pid it had when it was
> | > checkpointed. When containers are nested the tasks within the containers
> | > exist in multiple pid namespaces and hence have multiple pids to specify
> | > during restart.
> | > 
> | > clone_with_pids(), intended for use during restart, is the same as clone(),
> | > except that it takes a 'target_pid_set' paramter. This parameter lets caller
> | > choose specific pid numbers for the child process, in the process's active
> | > and descendant pid namespaces. Since the application does/can not know of any
> | > ancestor pid namespaces, it cannot choose a pid in those namespaces.
> 
> Grr, I got this backwards. The application chooses pid numbers in active
> and ancestor namespaces. It can/does NOT need a pid in descendant
> namespaces. I had it wrong in the code too and fixed it but missed the
> patch description.
> 
> 
> | 
> | I think the term "descendant pid namespaces" is confusing, because it
> | can be interpreted as the collection of all descendant namespaces (e.g.
> | by all children of a task), which is a tree. And I'm unsure what does
> | "ancestor pid namespaces" mean.
> | 
> | I'd say that both "descendant" and "ancestor" here are defined with
> | respect to the nesting level of root task of a restart. Or otherwise
> | say explicitly that it's relative to the nesting level at which the
> | restart operation occurs.
> | 
> | In my mind, clone_with_pid() is performed from "within" the deepest
> | level - in which the child will "live" - and the pids array then
> | works "bottom-up" in the sense that it indicates the desired pids
> | as you go up the ancestry chain (going up is always well defined).
> | 
> | Right now the restart with a flat pid-ns works by first devising a
> | schedule and then following that schedule with forks/clones to
> | generate a process tree.
> | 
> | To also restore pids, we need only use clone_with_pid() with an array
> | of size 1, and we're good.
> | 
> | To restart nested pid-ns from userspace, we need to devise a schedule
> | that will command a sequence of fork/clones. The schedule will tell
> | when to use the CLONE_NEWPID to create a sub-pid-ns. When that happens,
> | we'll increment the size of the pids array - for that clone and all
> | subsequent clones by that task and its descendents.
> | 
> | So the clone_with_pids occurs in the context of the deepest pid-ns, so
> | to speak, and the arrays of pids works its way "upwards"
> | 
> | (That's probably what you meant, anyway).
> | 
> | > 
> | > Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to
> | > prevent unprivileged processes from misusing this interface.
> | > 
> | > Call clone_with_pids as follows:
> | > 
> | > 	pid_t pids[] = { 0, 77, 99 };
> | > 	struct target_pid_set pid_set;
> | > 
> | > 	pid_set.num_pids = sizeof(pids) / sizeof(int);
> | > 	pid_set.target_pids = &pids;
> | > 
> | > 	syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set);
> | > 
> | > If a target-pid is 0, the kernel continues to assign a pid for the process in
> | > that namespace. In the above example, pids[0] is 0, meaning the kernel will
> | > assign next available pid to the process in init_pid_ns. But kernel will assign
> | > pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either
> | > 77 or 99 are taken, the system call fails with -EBUSY.
> | > 
> | > If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces,
> | > the system call fails with -EINVAL.
> | > 
> | > Its mostly an exploratory patch seeking feedback on the interface.
> | > 
> | > NOTE:
> | > 	Compared to clone(), clone_with_pids() needs to pass in two more
> | > 	pieces of information:
> | > 
> | > 		- number of pids in the set
> | > 		- user buffer containing the list of pids.
> | > 
> | > 	But since clone() already takes 5 parameters, use a 'struct
> | > 	target_pid_set'.
> | > 
> | > TODO:
> | > 	- Gently tested.
> | > 	- May need additional sanity checks in do_fork_with_pids().
> | > 	- Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in
> | > 	  the namespace is either 1 or 0).
> | > 
> | > Changelog[v2]:
> | > 	- (Oren Laadan) Specified target pids should apply to youngest
> | > 	  pid-namespaces only (see comments in copy_target_pids())
> | > 	- Remove unnecessary printk and add a note to callers of
> | > 	  copy_target_pids() to free target_pids.
> | > 	- (Matt Helsley) Update patch description.
> | > 	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
> | > 	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
> | > 	  'num_pids == 0' (fall back to normal clone()).
> | > 	- Move arch-independent code (sanity checks and copy-in of target-pids)
> | > 	  into kernel/fork.c and simplify sys_clone_with_pids()
> | > 
> | > Changelog[v1]:
> | > 	- Fixed some compile errors (had fixed these errors earlier in my
> | > 	  git tree but had not refreshed patches before emailing them)
> | > 
> | > Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
> | > Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> | > ---
> | 
> | [...]
> | 
> | 
> | > diff --git a/kernel/fork.c b/kernel/fork.c
> | > index a16ef7b..06b1583 100644
> | > --- a/kernel/fork.c
> | > +++ b/kernel/fork.c
> | > @@ -1335,6 +1335,84 @@ struct task_struct * __cpuinit fork_idle(int cpu)
> | >  }
> | >  
> | >  /*
> | > + * If user specified any 'target-pids' in @upid_setp, copy them from
> | > + * user and return a pointer to a local copy of the list of pids. The
> | > + * caller must free the list, when they are done using it.
> | > + *
> | > + * If user did not specify any target pids, return NULL (caller should
> | > + * treat this like normal clone).
> | > + *
> | > + * On any errors, return the error code
> | > + */
> | > +static pid_t *copy_target_pids(void __user *upid_setp)
> | > +{
> | > +	int j;
> | > +	int rc;
> | > +	int size;
> | > +	int num_pids;
> | > +	int nesting;
> | > +	pid_t __user *utarget_pids;
> | > +	pid_t *target_pids;
> | > +	struct target_pid_set pid_set;
> | > +
> | > +	if(!upid_setp)
> | > +		return NULL;
> | > +
> | > +	if (copy_from_user(&pid_set, upid_setp, sizeof(pid_set)))
> | > +		return ERR_PTR(-EFAULT);
> | > +
> | > +	num_pids = pid_set.num_pids;
> | > +	utarget_pids = pid_set.target_pids;
> | > +	nesting = task_pid(current)->level + 1;
> | 
> | I should have mentioned earlier, but there is also the case of
> | CLONE_NEWPID. If CLONE_NEWPID is given, then @nesting should be
> | plus one, _and_ the corresponding pid must be 1 or 0.
> 
> Right. do_fork_with_pids() checks if CLONE_NEWPID is specified with
> target_pids and returns -EINVAL for now.

Any reason not to handle this case already ?

I have a simpler suggestion than above: pass the clone_flags to
copy_target_pids(), and in there, if CLONE_NEWPID is set, then
you should allocate an array +1 in size, and force last slot to
be 0 (or 1). User doesn't have to pass a larger array.

Oren.


> | 
> | And this consideration deserves fat comment :)
> | 
> | > +
> | > +	if (!num_pids)
> | > +		return NULL;
> | > +
> | > +	if (num_pids < 0 || num_pids > nesting)
> | > +		return ERR_PTR(-EINVAL);
> | > +
> | > +	target_pids = kzalloc((nesting * sizeof(pid_t)), GFP_KERNEL);
> | > +	if (!target_pids)
> | > +		return ERR_PTR(-ENOMEM);
> | > +
> | > +	/*
> | > +	 * A process running in a level-1 pid namespace has two pid
> | > +	 * namespaces and hence two pid numbers. If this process is
> | > +	 * checkpointed, information about these two namespaces are
> | > +	 * saved. We refer to these namespaces as 'known namespaces'.
> | > +	 *
> | > +	 * If this checkpointed process is however restarted in a 
> | > +	 * level-2 pid namespace, the restarted process has an extra
> | > +	 * ancestor pid namespace (i.e 'unknown namespace').
> | > +	 *
> | > +	 * During restart, the process requests specific pids for its
> | > +	 * 'known namespaces' and lets kernel assign pids to its 'unknown
> | > +	 * namespaces'. 
> | > +	 *
> | > +	 * Since the requested-pids correspond to 'known namespaces' and
> | > +	 * since 'known-namespaces' are younger than (i.e descendants of)
> | > +	 * 'unknown-namespaces', copy requested pids to end of target_pids[]
> | > +	 * and copy zeroes to the beginning (so kernel can assign a pid for
> | > +	 * the unknown namespaces).
> | > +	 *
> | > +	 * NOTE: The order of pids in target_pids[] is oldest pid namespace
> | > +	 * 	 to youngest (i.e target_pids[0] corresponds to init_pid_ns). 
> | > +	 */
> | 
> | Reads good!
> | 
> | [...]
> | 
> | Oren.
> 

  parent reply	other threads:[~2009-05-29  5:54 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-28  4:37 [RFC][PATCH 1/7][v2] Factor out code to allocate pidmap page Sukadev Bhattiprolu
     [not found] ` <20090528043748.GA16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28  4:38   ` [RFC][PATCH 2/7][v2] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
2009-05-28  4:38   ` [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
     [not found]     ` <20090528043834.GC16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 12:52       ` Serge E. Hallyn
2009-05-28 14:47       ` Oren Laadan
2009-05-28  4:38   ` [RFC][PATCH 4/7][v2] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
2009-05-28  4:39   ` [RFC][PATCH 5/7][v2] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2009-05-28  4:39   ` [RFC][PATCH 6/7][v2] Define do_fork_with_pids() Sukadev Bhattiprolu
     [not found]     ` <20090528043929.GF16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 12:54       ` Serge E. Hallyn
2009-05-28 15:03       ` Oren Laadan
2009-05-28  4:39   ` [RFC][PATCH 7/7][v2] Define clone_with_pids syscall Sukadev Bhattiprolu
     [not found]     ` <20090528043945.GG16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 12:02       ` Matt Helsley
2009-05-28 15:01       ` Oren Laadan
     [not found]         ` <4A1EA73F.1080802-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-28 15:14           ` Serge E. Hallyn
     [not found]             ` <20090528151444.GA17772-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 17:01               ` Sukadev Bhattiprolu
     [not found]                 ` <20090528170103.GA26183-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 17:47                   ` Serge E. Hallyn
     [not found]                     ` <20090528174708.GA2236-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2009-05-28 18:00                       ` Sukadev Bhattiprolu
     [not found]                         ` <20090528180057.GA27191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 23:45                           ` Oren Laadan
2009-05-28 17:30           ` Sukadev Bhattiprolu
     [not found]             ` <20090528173019.GB26183-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 23:47               ` Oren Laadan
     [not found]                 ` <4A1F228C.2020201-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-29  1:16                   ` Sukadev Bhattiprolu
2009-05-29  3:05                   ` Sukadev Bhattiprolu
     [not found]                     ` <20090529030558.GA2548-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-29  5:29                       ` Oren Laadan
     [not found]                         ` <20090529054645.GA3344@us.ibm.com>
     [not found]                           ` <20090529054645.GA3344-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-29  5:54                             ` Oren Laadan [this message]
     [not found]                               ` <4A1F78AF.6030404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-29 17:06                                 ` Sukadev Bhattiprolu
     [not found]                                   ` <20090529170616.GA12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-29 19:34                                     ` Sukadev Bhattiprolu
     [not found]                                       ` <20090529193416.GB12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-29 20:01                                         ` Oren Laadan
     [not found]                                           ` <4A203F2E.1060807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-29 21:19                                             ` Sukadev Bhattiprolu
     [not found]                                               ` <20090529211922.GC12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-29 21:32                                                 ` Oren Laadan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A1F78AF.6030404@cs.columbia.edu \
    --to=orenl-eqauephvms7envbuuze7ea@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=haveblue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org \
    --cc=sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox