* [RFC][PATCH 1/7][v2] Factor out code to allocate pidmap page
@ 2009-05-28 4:37 Sukadev Bhattiprolu
[not found] ` <20090528043748.GA16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 29+ messages in thread
From: Sukadev Bhattiprolu @ 2009-05-28 4:37 UTC (permalink / raw)
To: Oren Laadan; +Cc: Containers, David C. Hansen
From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Date: Mon, 4 May 2009 01:17:39 -0700
Subject: [RFC][PATCH 1/7][v2] Factor out code to allocate pidmap page
To implement support for clone_with_pids() system call we would
need to allocate pidmap page in more than one place. Move this
code to a new function alloc_pidmap_page().
Changelog[v2]:
- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
-ENOMEM on error instead of -1.
Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
kernel/pid.c | 46 ++++++++++++++++++++++++++++++----------------
1 files changed, 30 insertions(+), 16 deletions(-)
diff --git a/kernel/pid.c b/kernel/pid.c
index b2e5f78..9ff33cc 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,9 +122,34 @@ static void free_pidmap(struct upid *upid)
atomic_inc(&map->nr_free);
}
+static int alloc_pidmap_page(struct pidmap *map)
+{
+ void *page;
+
+ if (likely(map->page))
+ return 0;
+
+ page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+
+ /*
+ * Free the page if someone raced with us installing it:
+ */
+ spin_lock_irq(&pidmap_lock);
+ if (map->page)
+ kfree(page);
+ else
+ map->page = page;
+ spin_unlock_irq(&pidmap_lock);
+
+ if (unlikely(!map->page))
+ return -ENOMEM;
+
+ return 0;
+}
+
static int alloc_pidmap(struct pid_namespace *pid_ns)
{
- int i, offset, max_scan, pid, last = pid_ns->last_pid;
+ int i, rc, offset, max_scan, pid, last = pid_ns->last_pid;
struct pidmap *map;
pid = last + 1;
@@ -134,21 +159,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
for (i = 0; i <= max_scan; ++i) {
- if (unlikely(!map->page)) {
- void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
- /*
- * Free the page if someone raced with us
- * installing it:
- */
- spin_lock_irq(&pidmap_lock);
- if (map->page)
- kfree(page);
- else
- map->page = page;
- spin_unlock_irq(&pidmap_lock);
- if (unlikely(!map->page))
- break;
- }
+ rc = alloc_pidmap_page(map);
+ if (rc)
+ break;
+
if (likely(atomic_read(&map->nr_free))) {
do {
if (!test_and_set_bit(offset, map->page)) {
--
1.5.2.5
^ permalink raw reply related [flat|nested] 29+ messages in thread[parent not found: <20090528043748.GA16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* [RFC][PATCH 2/7][v2] Have alloc_pidmap() return actual error code [not found] ` <20090528043748.GA16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-28 4:38 ` Sukadev Bhattiprolu 2009-05-28 4:38 ` [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu ` (4 subsequent siblings) 5 siblings, 0 replies; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-28 4:38 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen From 991fb474b055d36c4516cf7f79a247b7d79819ae Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Date: Mon, 4 May 2009 01:17:40 -0700 Subject: [RFC][PATCH 2/7][v2] Have alloc_pidmap() return actual error code alloc_pidmap() can fail either because all pid numbers are in use or because memory allocation failed. With support for setting a specific pid number, alloc_pidmap() would also fail if either the given pid number is invalid or in use. Rather than have callers assume -ENOMEM, have alloc_pidmap() return the actual error. Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- kernel/fork.c | 5 +++-- kernel/pid.c | 9 ++++++--- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index b9e2edd..f8411a8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1119,10 +1119,11 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != &init_struct_pid) { - retval = -ENOMEM; pid = alloc_pid(p->nsproxy->pid_ns); - if (!pid) + if (IS_ERR(pid)) { + retval = PTR_ERR(pid); goto bad_fork_cleanup_io; + } if (clone_flags & CLONE_NEWPID) { retval = pid_ns_prepare_proc(p->nsproxy->pid_ns); diff --git a/kernel/pid.c b/kernel/pid.c index 9ff33cc..b2d6a19 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -158,6 +158,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) offset = pid & BITS_PER_PAGE_MASK; map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; + rc = -EAGAIN; for (i = 0; i <= max_scan; ++i) { rc = alloc_pidmap_page(map); if (rc) @@ -188,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) } else { map = &pid_ns->pidmap[0]; offset = RESERVED_PIDS; - if (unlikely(last == offset)) + if (unlikely(last == offset)) { + rc = -EAGAIN; break; + } } pid = mk_pid(pid_ns, map, offset); } - return -1; + return rc; } int next_pidmap(struct pid_namespace *pid_ns, int last) @@ -298,7 +301,7 @@ out_free: free_pidmap(pid->numbers + i); kmem_cache_free(ns->pid_cachep, pid); - pid = NULL; + pid = ERR_PTR(nr); goto out; } -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() [not found] ` <20090528043748.GA16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-28 4:38 ` [RFC][PATCH 2/7][v2] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu @ 2009-05-28 4:38 ` Sukadev Bhattiprolu [not found] ` <20090528043834.GC16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-28 4:38 ` [RFC][PATCH 4/7][v2] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu ` (3 subsequent siblings) 5 siblings, 1 reply; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-28 4:38 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen From a1fdec1036a952359d02a7c667d126bd2fff6804 Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Date: Mon, 4 May 2009 01:17:41 -0700 Subject: [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() With support for setting a specific pid number for a process, alloc_pidmap() will need a paramter a 'target_pid' parameter. Changelog[v2]: - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code actually checks for 'pid <= 0' for completeness). Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> --- kernel/pid.c | 28 ++++++++++++++++++++++++++-- 1 files changed, 26 insertions(+), 2 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index b2d6a19..b44dd21 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -147,11 +147,35 @@ static int alloc_pidmap_page(struct pidmap *map) return 0; } -static int alloc_pidmap(struct pid_namespace *pid_ns) +static int set_pidmap(struct pid_namespace *pid_ns, int pid) +{ + int offset; + struct pidmap *map; + + if (pid <= 0 || pid >= pid_max) + return -EINVAL; + + offset = pid & BITS_PER_PAGE_MASK; + map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; + + if (alloc_pidmap_page(map)) + return -ENOMEM; + + if (test_and_set_bit(offset, map->page)) + return -EBUSY; + + atomic_dec(&map->nr_free); + return pid; +} + +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid) { int i, rc, offset, max_scan, pid, last = pid_ns->last_pid; struct pidmap *map; + if (target_pid) + return set_pidmap(pid_ns, target_pid); + pid = last + 1; if (pid >= pid_max) pid = RESERVED_PIDS; @@ -270,7 +294,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns->level; i >= 0; i--) { - nr = alloc_pidmap(tmp); + nr = alloc_pidmap(tmp, 0); if (nr < 0) goto out_free; -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 29+ messages in thread
[parent not found: <20090528043834.GC16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() [not found] ` <20090528043834.GC16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-28 12:52 ` Serge E. Hallyn 2009-05-28 14:47 ` Oren Laadan 1 sibling, 0 replies; 29+ messages in thread From: Serge E. Hallyn @ 2009-05-28 12:52 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org): > > >From a1fdec1036a952359d02a7c667d126bd2fff6804 Mon Sep 17 00:00:00 2001 > From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Date: Mon, 4 May 2009 01:17:41 -0700 > Subject: [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() > > With support for setting a specific pid number for a process, > alloc_pidmap() will need a paramter a 'target_pid' parameter. > > Changelog[v2]: > - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code > actually checks for 'pid <= 0' for completeness). > > Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> > --- > kernel/pid.c | 28 ++++++++++++++++++++++++++-- > 1 files changed, 26 insertions(+), 2 deletions(-) > > diff --git a/kernel/pid.c b/kernel/pid.c > index b2d6a19..b44dd21 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -147,11 +147,35 @@ static int alloc_pidmap_page(struct pidmap *map) > return 0; > } > > -static int alloc_pidmap(struct pid_namespace *pid_ns) > +static int set_pidmap(struct pid_namespace *pid_ns, int pid) > +{ > + int offset; > + struct pidmap *map; > + > + if (pid <= 0 || pid >= pid_max) > + return -EINVAL; > + > + offset = pid & BITS_PER_PAGE_MASK; > + map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; > + > + if (alloc_pidmap_page(map)) > + return -ENOMEM; > + > + if (test_and_set_bit(offset, map->page)) > + return -EBUSY; > + > + atomic_dec(&map->nr_free); > + return pid; > +} > + > +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid) > { > int i, rc, offset, max_scan, pid, last = pid_ns->last_pid; > struct pidmap *map; > > + if (target_pid) > + return set_pidmap(pid_ns, target_pid); > + > pid = last + 1; > if (pid >= pid_max) > pid = RESERVED_PIDS; > @@ -270,7 +294,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) > > tmp = ns; > for (i = ns->level; i >= 0; i--) { > - nr = alloc_pidmap(tmp); > + nr = alloc_pidmap(tmp, 0); > if (nr < 0) > goto out_free; > > -- > 1.5.2.5 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() [not found] ` <20090528043834.GC16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-28 12:52 ` Serge E. Hallyn @ 2009-05-28 14:47 ` Oren Laadan 1 sibling, 0 replies; 29+ messages in thread From: Oren Laadan @ 2009-05-28 14:47 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Sukadev Bhattiprolu wrote: > From a1fdec1036a952359d02a7c667d126bd2fff6804 Mon Sep 17 00:00:00 2001 > From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Date: Mon, 4 May 2009 01:17:41 -0700 > Subject: [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() > > With support for setting a specific pid number for a process, > alloc_pidmap() will need a paramter a 'target_pid' parameter. > > Changelog[v2]: > - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code > actually checks for 'pid <= 0' for completeness). > > Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > --- Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> > kernel/pid.c | 28 ++++++++++++++++++++++++++-- > 1 files changed, 26 insertions(+), 2 deletions(-) > > diff --git a/kernel/pid.c b/kernel/pid.c > index b2d6a19..b44dd21 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -147,11 +147,35 @@ static int alloc_pidmap_page(struct pidmap *map) > return 0; > } > > -static int alloc_pidmap(struct pid_namespace *pid_ns) > +static int set_pidmap(struct pid_namespace *pid_ns, int pid) > +{ > + int offset; > + struct pidmap *map; > + > + if (pid <= 0 || pid >= pid_max) > + return -EINVAL; > + > + offset = pid & BITS_PER_PAGE_MASK; > + map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; > + > + if (alloc_pidmap_page(map)) > + return -ENOMEM; > + > + if (test_and_set_bit(offset, map->page)) > + return -EBUSY; > + > + atomic_dec(&map->nr_free); > + return pid; > +} > + > +static int alloc_pidmap(struct pid_namespace *pid_ns, int target_pid) > { > int i, rc, offset, max_scan, pid, last = pid_ns->last_pid; > struct pidmap *map; > > + if (target_pid) > + return set_pidmap(pid_ns, target_pid); > + > pid = last + 1; > if (pid >= pid_max) > pid = RESERVED_PIDS; > @@ -270,7 +294,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) > > tmp = ns; > for (i = ns->level; i >= 0; i--) { > - nr = alloc_pidmap(tmp); > + nr = alloc_pidmap(tmp, 0); > if (nr < 0) > goto out_free; > ^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC][PATCH 4/7][v2] Add target_pids parameter to alloc_pid() [not found] ` <20090528043748.GA16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-28 4:38 ` [RFC][PATCH 2/7][v2] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu 2009-05-28 4:38 ` [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu @ 2009-05-28 4:38 ` Sukadev Bhattiprolu 2009-05-28 4:39 ` [RFC][PATCH 5/7][v2] Add target_pids parameter to copy_process() Sukadev Bhattiprolu ` (2 subsequent siblings) 5 siblings, 0 replies; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-28 4:38 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Date: Mon, 4 May 2009 01:17:42 -0700 Subject: [RFC][PATCH 4/7][v2] Add target_pids parameter to alloc_pid() With support for setting a specific pid numbers, alloc_pid() would need to take a set of 'target-pids' which gives the user-specified pids. Add this parameter to alloc_pid(), but leave it set to NULL for now. The parameter will be used in a follow-on patch. Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- include/linux/pid.h | 2 +- kernel/fork.c | 3 ++- kernel/pid.c | 9 +++++++-- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 49f1c2f..914185d 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); int next_pidmap(struct pid_namespace *pid_ns, int last); -extern struct pid *alloc_pid(struct pid_namespace *ns); +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids); extern void free_pid(struct pid *pid); /* diff --git a/kernel/fork.c b/kernel/fork.c index f8411a8..d2d69d3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -954,6 +954,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, int retval; struct task_struct *p; int cgroup_callbacks_done = 0; + pid_t *target_pids = NULL; if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1119,7 +1120,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != &init_struct_pid) { - pid = alloc_pid(p->nsproxy->pid_ns); + pid = alloc_pid(p->nsproxy->pid_ns, target_pids); if (IS_ERR(pid)) { retval = PTR_ERR(pid); goto bad_fork_cleanup_io; diff --git a/kernel/pid.c b/kernel/pid.c index b44dd21..090b221 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -280,13 +280,14 @@ void free_pid(struct pid *pid) call_rcu(&pid->rcu, delayed_put_pid); } -struct pid *alloc_pid(struct pid_namespace *ns) +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids) { struct pid *pid; enum pid_type type; int i, nr; struct pid_namespace *tmp; struct upid *upid; + int tpid; pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); if (!pid) @@ -294,7 +295,11 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns->level; i >= 0; i--) { - nr = alloc_pidmap(tmp, 0); + tpid = 0; + if (target_pids) + tpid = target_pids[i]; + + nr = alloc_pidmap(tmp, tpid); if (nr < 0) goto out_free; -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC][PATCH 5/7][v2] Add target_pids parameter to copy_process() [not found] ` <20090528043748.GA16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> ` (2 preceding siblings ...) 2009-05-28 4:38 ` [RFC][PATCH 4/7][v2] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu @ 2009-05-28 4:39 ` Sukadev Bhattiprolu 2009-05-28 4:39 ` [RFC][PATCH 6/7][v2] Define do_fork_with_pids() Sukadev Bhattiprolu 2009-05-28 4:39 ` [RFC][PATCH 7/7][v2] Define clone_with_pids syscall Sukadev Bhattiprolu 5 siblings, 0 replies; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-28 4:39 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen From 432bc68b622661cd4a28379e98e3ecc8f44d915d Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Date: Mon, 4 May 2009 01:17:43 -0700 Subject: [RFC][PATCH 5/7][v2] Add target_pids parameter to copy_process() Add a 'target_pids' parameter to copy_process(). The new parameter will be used in a follow-on patch when clone_with_pids() is implemented. Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- kernel/fork.c | 7 ++++--- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index d2d69d3..373411e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -949,12 +949,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, unsigned long stack_size, int __user *child_tidptr, struct pid *pid, + pid_t *target_pids, int trace) { int retval; struct task_struct *p; int cgroup_callbacks_done = 0; - pid_t *target_pids = NULL; if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1327,7 +1327,7 @@ struct task_struct * __cpuinit fork_idle(int cpu) struct pt_regs regs; task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL, - &init_struct_pid, 0); + &init_struct_pid, NULL, 0); if (!IS_ERR(task)) init_idle(task, cpu); @@ -1350,6 +1350,7 @@ long do_fork(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; + pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1390,7 +1391,7 @@ long do_fork(unsigned long clone_flags, trace = tracehook_prepare_clone(clone_flags); p = copy_process(clone_flags, stack_start, regs, stack_size, - child_tidptr, NULL, trace); + child_tidptr, NULL, target_pids, trace); /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC][PATCH 6/7][v2] Define do_fork_with_pids() [not found] ` <20090528043748.GA16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> ` (3 preceding siblings ...) 2009-05-28 4:39 ` [RFC][PATCH 5/7][v2] Add target_pids parameter to copy_process() Sukadev Bhattiprolu @ 2009-05-28 4:39 ` Sukadev Bhattiprolu [not found] ` <20090528043929.GF16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-28 4:39 ` [RFC][PATCH 7/7][v2] Define clone_with_pids syscall Sukadev Bhattiprolu 5 siblings, 1 reply; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-28 4:39 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Date: Mon, 4 May 2009 01:17:44 -0700 Subject: [RFC][PATCH 6/7][v2] Define do_fork_with_pids() do_fork_with_pids() is same as do_fork(), except that it takes an additional, 'pid_set', parameter. This parameter, currently unused, specifies the set of target pids of the process in each of its pid namespaces. Changelog[v2]: - [v1] of this patch had some architecture-indpendent code in arch/x86/kernel/process_32.c. To facilitate moving this code to kernel/fork.c, in the next patch, [v2] of the patch passes 'struct target_pid_set __user *' to do_fork_with_pids() instead of 'pid_t *'. Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> --- include/linux/sched.h | 1 + include/linux/types.h | 5 +++++ kernel/fork.c | 16 ++++++++++++++-- 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index b4c38bc..8468e54 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1995,6 +1995,7 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, struct target_pid_set __user *pid_set); struct task_struct *fork_idle(int); extern void set_task_comm(struct task_struct *tsk, char *from); diff --git a/include/linux/types.h b/include/linux/types.h index 5abe354..17ec186 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -204,6 +204,11 @@ struct ustat { char f_fpack[6]; }; +struct target_pid_set { + int num_pids; + pid_t *target_pids; +}; + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ #endif /* _LINUX_TYPES_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 373411e..a16ef7b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1340,12 +1340,13 @@ struct task_struct * __cpuinit fork_idle(int cpu) * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. */ -long do_fork(unsigned long clone_flags, +long do_fork_with_pids(unsigned long clone_flags, unsigned long stack_start, struct pt_regs *regs, unsigned long stack_size, int __user *parent_tidptr, - int __user *child_tidptr) + int __user *child_tidptr, + struct target_pid_set __user *pid_setp) { struct task_struct *p; int trace = 0; @@ -1448,6 +1449,17 @@ long do_fork(unsigned long clone_flags, return nr; } +long do_fork(unsigned long clone_flags, + unsigned long stack_start, + struct pt_regs *regs, + unsigned long stack_size, + int __user *parent_tidptr, + int __user *child_tidptr) +{ + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, + parent_tidptr, child_tidptr, NULL); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 29+ messages in thread
[parent not found: <20090528043929.GF16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 6/7][v2] Define do_fork_with_pids() [not found] ` <20090528043929.GF16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-28 12:54 ` Serge E. Hallyn 2009-05-28 15:03 ` Oren Laadan 1 sibling, 0 replies; 29+ messages in thread From: Serge E. Hallyn @ 2009-05-28 12:54 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org): > > From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Date: Mon, 4 May 2009 01:17:44 -0700 > Subject: [RFC][PATCH 6/7][v2] Define do_fork_with_pids() > > do_fork_with_pids() is same as do_fork(), except that it takes an > additional, 'pid_set', parameter. This parameter, currently unused, > specifies the set of target pids of the process in each of its pid > namespaces. > > Changelog[v2]: > - [v1] of this patch had some architecture-indpendent code in > arch/x86/kernel/process_32.c. To facilitate moving this code > to kernel/fork.c, in the next patch, [v2] of the patch passes > 'struct target_pid_set __user *' to do_fork_with_pids() instead > of 'pid_t *'. > > Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> > --- > include/linux/sched.h | 1 + > include/linux/types.h | 5 +++++ > kernel/fork.c | 16 ++++++++++++++-- > 3 files changed, 20 insertions(+), 2 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index b4c38bc..8468e54 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1995,6 +1995,7 @@ extern int disallow_signal(int); > > extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); > extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); > +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, struct target_pid_set __user *pid_set); > struct task_struct *fork_idle(int); > > extern void set_task_comm(struct task_struct *tsk, char *from); > diff --git a/include/linux/types.h b/include/linux/types.h > index 5abe354..17ec186 100644 > --- a/include/linux/types.h > +++ b/include/linux/types.h > @@ -204,6 +204,11 @@ struct ustat { > char f_fpack[6]; > }; > > +struct target_pid_set { > + int num_pids; > + pid_t *target_pids; > +}; > + > #endif /* __KERNEL__ */ > #endif /* __ASSEMBLY__ */ > #endif /* _LINUX_TYPES_H */ > diff --git a/kernel/fork.c b/kernel/fork.c > index 373411e..a16ef7b 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1340,12 +1340,13 @@ struct task_struct * __cpuinit fork_idle(int cpu) > * It copies the process, and if successful kick-starts > * it and waits for it to finish using the VM if required. > */ > -long do_fork(unsigned long clone_flags, > +long do_fork_with_pids(unsigned long clone_flags, > unsigned long stack_start, > struct pt_regs *regs, > unsigned long stack_size, > int __user *parent_tidptr, > - int __user *child_tidptr) > + int __user *child_tidptr, > + struct target_pid_set __user *pid_setp) > { > struct task_struct *p; > int trace = 0; > @@ -1448,6 +1449,17 @@ long do_fork(unsigned long clone_flags, > return nr; > } > > +long do_fork(unsigned long clone_flags, > + unsigned long stack_start, > + struct pt_regs *regs, > + unsigned long stack_size, > + int __user *parent_tidptr, > + int __user *child_tidptr) > +{ > + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, > + parent_tidptr, child_tidptr, NULL); > +} > + > #ifndef ARCH_MIN_MMSTRUCT_ALIGN > #define ARCH_MIN_MMSTRUCT_ALIGN 0 > #endif > -- > 1.5.2.5 ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC][PATCH 6/7][v2] Define do_fork_with_pids() [not found] ` <20090528043929.GF16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-28 12:54 ` Serge E. Hallyn @ 2009-05-28 15:03 ` Oren Laadan 1 sibling, 0 replies; 29+ messages in thread From: Oren Laadan @ 2009-05-28 15:03 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Sukadev Bhattiprolu wrote: > From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Date: Mon, 4 May 2009 01:17:44 -0700 > Subject: [RFC][PATCH 6/7][v2] Define do_fork_with_pids() > > do_fork_with_pids() is same as do_fork(), except that it takes an > additional, 'pid_set', parameter. This parameter, currently unused, > specifies the set of target pids of the process in each of its pid > namespaces. > > Changelog[v2]: > - [v1] of this patch had some architecture-indpendent code in > arch/x86/kernel/process_32.c. To facilitate moving this code > to kernel/fork.c, in the next patch, [v2] of the patch passes > 'struct target_pid_set __user *' to do_fork_with_pids() instead > of 'pid_t *'. > > Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > --- Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> > include/linux/sched.h | 1 + > include/linux/types.h | 5 +++++ > kernel/fork.c | 16 ++++++++++++++-- > 3 files changed, 20 insertions(+), 2 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index b4c38bc..8468e54 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1995,6 +1995,7 @@ extern int disallow_signal(int); > > extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); > extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); > +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, struct target_pid_set __user *pid_set); > struct task_struct *fork_idle(int); > > extern void set_task_comm(struct task_struct *tsk, char *from); > diff --git a/include/linux/types.h b/include/linux/types.h > index 5abe354..17ec186 100644 > --- a/include/linux/types.h > +++ b/include/linux/types.h > @@ -204,6 +204,11 @@ struct ustat { > char f_fpack[6]; > }; > > +struct target_pid_set { > + int num_pids; > + pid_t *target_pids; > +}; > + > #endif /* __KERNEL__ */ > #endif /* __ASSEMBLY__ */ > #endif /* _LINUX_TYPES_H */ > diff --git a/kernel/fork.c b/kernel/fork.c > index 373411e..a16ef7b 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1340,12 +1340,13 @@ struct task_struct * __cpuinit fork_idle(int cpu) > * It copies the process, and if successful kick-starts > * it and waits for it to finish using the VM if required. > */ > -long do_fork(unsigned long clone_flags, > +long do_fork_with_pids(unsigned long clone_flags, > unsigned long stack_start, > struct pt_regs *regs, > unsigned long stack_size, > int __user *parent_tidptr, > - int __user *child_tidptr) > + int __user *child_tidptr, > + struct target_pid_set __user *pid_setp) > { > struct task_struct *p; > int trace = 0; > @@ -1448,6 +1449,17 @@ long do_fork(unsigned long clone_flags, > return nr; > } > > +long do_fork(unsigned long clone_flags, > + unsigned long stack_start, > + struct pt_regs *regs, > + unsigned long stack_size, > + int __user *parent_tidptr, > + int __user *child_tidptr) > +{ > + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, > + parent_tidptr, child_tidptr, NULL); > +} > + > #ifndef ARCH_MIN_MMSTRUCT_ALIGN > #define ARCH_MIN_MMSTRUCT_ALIGN 0 > #endif ^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090528043748.GA16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> ` (4 preceding siblings ...) 2009-05-28 4:39 ` [RFC][PATCH 6/7][v2] Define do_fork_with_pids() Sukadev Bhattiprolu @ 2009-05-28 4:39 ` Sukadev Bhattiprolu [not found] ` <20090528043945.GG16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 5 siblings, 1 reply; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-28 4:39 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Date: Mon, 4 May 2009 01:17:45 -0700 Subject: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall clone_with_pids() is same as clone(), except that it takes a 'target_pid_set' paramter which lets caller choose a specific pid number for the child process in each of the child process's pid namespace. This system call would be needed to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with its original pids). Call clone_with_pids as follows: pid_t pids[] = { 0, 77, 99 }; struct target_pid_set pid_set; pid_set.num_pids = sizeof(pids) / sizeof(int); pid_set.target_pids = &pids; syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set); If a target-pid is 0, the kernel continues to assign a pid for the process in that namespace. In the above example, pids[0] is 0, meaning the kernel will assign next available pid to the process in init_pid_ns. But kernel will assign pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either 77 or 99 are taken, the system call fails with -EBUSY. If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces, the system call fails with -EINVAL. Its mostly an exploratory patch seeking feedback on the interface. NOTE: 1. clone_with_pids(), at least for now, needs CAP_SYS_ADMIN to prevent misuse of the interface. 2. Compared to clone(), clone_with_pids() needs to pass in two more pieces of information: - number of pids in the set - user buffer containing the list of pids. But since clone() already takes 5 parameters, use a 'struct target_pid_set'. TODO: - Gently tested. - May need additional sanity checks in do_fork_with_pids(). - Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in the namespace is either 1 or 0). Changelog[v2]: - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description. - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and 'num_pids == 0' (fall back to normal clone()). - Move arch-independent code (sanity checks and copy-in of target-pids) into kernel/fork.c and simplify sys_clone_with_pids() Changelog[v1]: - Fixed some compile errors (had fixed these errors earlier in my git tree but had not refreshed patches before emailing them) Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> --- arch/x86/include/asm/syscalls.h | 1 + arch/x86/include/asm/unistd_32.h | 1 + arch/x86/kernel/entry_32.S | 1 + arch/x86/kernel/process_32.c | 21 +++++++++ arch/x86/kernel/syscall_table_32.S | 1 + kernel/fork.c | 81 +++++++++++++++++++++++++++++++++++- 6 files changed, 105 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 7043408..1fdc149 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -31,6 +31,7 @@ asmlinkage int sys_get_thread_area(struct user_desc __user *); /* kernel/process_32.c */ int sys_fork(struct pt_regs *); int sys_clone(struct pt_regs *); +int sys_clone_with_pids(struct pt_regs *); int sys_vfork(struct pt_regs *); int sys_execve(struct pt_regs *); diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..90f906f 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,7 @@ #define __NR_inotify_init1 332 #define __NR_preadv 333 #define __NR_pwritev 334 +#define __NR_clone_with_pids 335 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index c929add..ee92b0d 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -707,6 +707,7 @@ ptregs_##name: \ PTREGSCALL(iopl) PTREGSCALL(fork) PTREGSCALL(clone) +PTREGSCALL(clone_with_pids) PTREGSCALL(vfork) PTREGSCALL(execve) PTREGSCALL(sigaltstack) diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 76f8f84..1efc3de 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -445,6 +445,27 @@ int sys_clone(struct pt_regs *regs) return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr); } +int sys_clone_with_pids(struct pt_regs *regs) +{ + unsigned long clone_flags; + unsigned long newsp; + int __user *parent_tidptr; + int __user *child_tidptr; + void __user *upid_setp; + + clone_flags = regs->bx; + newsp = regs->cx; + parent_tidptr = (int __user *)regs->dx; + child_tidptr = (int __user *)regs->di; + upid_setp = (void __user *)regs->bp; + + if (!newsp) + newsp = regs->sp; + + return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr, + child_tidptr, upid_setp); +} + /* * sys_execve() executes a new program. */ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index ff5c873..94c1a58 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -334,3 +334,4 @@ ENTRY(sys_call_table) .long sys_inotify_init1 .long sys_preadv .long sys_pwritev + .long ptregs_clone_with_pids /* 335 */ diff --git a/kernel/fork.c b/kernel/fork.c index a16ef7b..f265a18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1335,6 +1335,58 @@ struct task_struct * __cpuinit fork_idle(int cpu) } /* + * If user specified any 'target-pids' in @upid_setp, copy them from + * user and return a pointer to the list of target pids. + * + * If user did not specify any target pids, return NULL (caller should + * treat this like normal clone). + * + * On any errors, return the error code + */ +static pid_t *copy_target_pids(void __user *upid_setp) +{ + int rc; + int size; + int num_pids; + pid_t __user *utarget_pids; + pid_t *target_pids; + struct target_pid_set pid_set; + + if(!upid_setp) + return NULL; + + if (copy_from_user(&pid_set, upid_setp, sizeof(pid_set))) + return ERR_PTR(-EFAULT); + + num_pids = pid_set.num_pids; + utarget_pids = pid_set.target_pids; + size = num_pids * sizeof(pid_t); + + if (!num_pids) + return NULL; + + if (num_pids < 0 || num_pids > task_pid(current)->level + 1) + return ERR_PTR(-EINVAL); + + target_pids = kzalloc(size, GFP_KERNEL); + if (!target_pids) + return ERR_PTR(-ENOMEM); + + rc = -EFAULT; + if (copy_from_user(target_pids, pid_set.target_pids, size)) + goto out_free; + + printk(KERN_ERR "clone_with_pids() num_pids %d, [ %d, %d ]\n", num_pids, + target_pids[0], target_pids[1]); + + return target_pids; + +out_free: + kfree(target_pids); + return ERR_PTR(rc); +} + +/* * Ok, this is the main fork-routine. * * It copies the process, and if successful kick-starts @@ -1351,7 +1403,7 @@ long do_fork_with_pids(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; - pid_t *target_pids = NULL; + pid_t *target_pids; /* * Do some preliminary argument and permissions checking before we @@ -1385,6 +1437,29 @@ long do_fork_with_pids(unsigned long clone_flags, } } + target_pids = copy_target_pids(pid_setp); + + if (target_pids) { + if (IS_ERR(target_pids)) + return PTR_ERR(target_pids); + + nr = -EPERM; + if (!capable(CAP_SYS_ADMIN)) + goto out_free; + + /* + * CLONE_NEWPID implies pid == 1 + * + * TODO: Should this be more fine-grained ? (i.e would we want + * to have a container-init have a specific pid in an + * ancestor namespace ?) Maybe needed to checkpoint/ + * restart an application that has a nested container. + */ + nr = -EINVAL; + if (clone_flags & CLONE_NEWPID) + goto out_free; + } + /* * When called from kernel_thread, don't do user tracing stuff. */ @@ -1446,6 +1521,10 @@ long do_fork_with_pids(unsigned long clone_flags, } else { nr = PTR_ERR(p); } + +out_free: + kfree(target_pids); + return nr; } -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 29+ messages in thread
[parent not found: <20090528043945.GG16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090528043945.GG16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-28 12:02 ` Matt Helsley 2009-05-28 15:01 ` Oren Laadan 1 sibling, 0 replies; 29+ messages in thread From: Matt Helsley @ 2009-05-28 12:02 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen On Wed, May 27, 2009 at 09:39:45PM -0700, Sukadev Bhattiprolu wrote: > > From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Date: Mon, 4 May 2009 01:17:45 -0700 > Subject: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall > > clone_with_pids() is same as clone(), except that it takes a 'target_pid_set' > paramter which lets caller choose a specific pid number for the child process > in each of the child process's pid namespace. This system call would be needed > to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with > its original pids). > > Call clone_with_pids as follows: > > pid_t pids[] = { 0, 77, 99 }; > struct target_pid_set pid_set; > > pid_set.num_pids = sizeof(pids) / sizeof(int); > pid_set.target_pids = &pids; > > syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set); > > If a target-pid is 0, the kernel continues to assign a pid for the process in > that namespace. In the above example, pids[0] is 0, meaning the kernel will > assign next available pid to the process in init_pid_ns. But kernel will assign > pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either > 77 or 99 are taken, the system call fails with -EBUSY. > > If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces, > the system call fails with -EINVAL. > The patch description shows the solution before alluding to the problem. You could prepend a more elaborate problem description: Container restart requires that a task have the same pid it had when it was checkpointed. When containers are nested the tasks within the containers exist in multiple pid namespaces and hence have multiple pids to specify during restart. clone_with_pids(), intended for use during restart, is the same as clone(), ... I think that makes it clearer why you're introducing clone_with_pids() and why clone() or clone_with_pid() would be insufficient. Cheers, -Matt ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090528043945.GG16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-28 12:02 ` Matt Helsley @ 2009-05-28 15:01 ` Oren Laadan [not found] ` <4A1EA73F.1080802-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 1 sibling, 1 reply; 29+ messages in thread From: Oren Laadan @ 2009-05-28 15:01 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Sukadev Bhattiprolu wrote: > From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Date: Mon, 4 May 2009 01:17:45 -0700 > Subject: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall > > clone_with_pids() is same as clone(), except that it takes a 'target_pid_set' > paramter which lets caller choose a specific pid number for the child process > in each of the child process's pid namespace. This system call would be needed > to implement Checkpoint/Restart (i.e after a checkpoint, restart a process with > its original pids). > > Call clone_with_pids as follows: > > pid_t pids[] = { 0, 77, 99 }; > struct target_pid_set pid_set; > > pid_set.num_pids = sizeof(pids) / sizeof(int); > pid_set.target_pids = &pids; > > syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set); > > If a target-pid is 0, the kernel continues to assign a pid for the process in > that namespace. In the above example, pids[0] is 0, meaning the kernel will > assign next available pid to the process in init_pid_ns. But kernel will assign > pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either > 77 or 99 are taken, the system call fails with -EBUSY. > > If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces, > the system call fails with -EINVAL. > > Its mostly an exploratory patch seeking feedback on the interface. > > NOTE: > 1. clone_with_pids(), at least for now, needs CAP_SYS_ADMIN to prevent > misuse of the interface. > > 2. Compared to clone(), clone_with_pids() needs to pass in two more > pieces of information: > > - number of pids in the set > - user buffer containing the list of pids. > > But since clone() already takes 5 parameters, use a 'struct > target_pid_set'. > > TODO: > - Gently tested. > - May need additional sanity checks in do_fork_with_pids(). > - Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in > the namespace is either 1 or 0). > > Changelog[v2]: > - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description. > - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and > 'num_pids == 0' (fall back to normal clone()). > - Move arch-independent code (sanity checks and copy-in of target-pids) > into kernel/fork.c and simplify sys_clone_with_pids() > > Changelog[v1]: > - Fixed some compile errors (had fixed these errors earlier in my > git tree but had not refreshed patches before emailing them) > > Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> [...] > index a16ef7b..f265a18 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1335,6 +1335,58 @@ struct task_struct * __cpuinit fork_idle(int cpu) > } > > /* > + * If user specified any 'target-pids' in @upid_setp, copy them from > + * user and return a pointer to the list of target pids. > + * > + * If user did not specify any target pids, return NULL (caller should > + * treat this like normal clone). > + * > + * On any errors, return the error code > + */ > +static pid_t *copy_target_pids(void __user *upid_setp) > +{ > + int rc; > + int size; > + int num_pids; > + pid_t __user *utarget_pids; > + pid_t *target_pids; > + struct target_pid_set pid_set; > + > + if(!upid_setp) > + return NULL; > + > + if (copy_from_user(&pid_set, upid_setp, sizeof(pid_set))) > + return ERR_PTR(-EFAULT); > + > + num_pids = pid_set.num_pids; > + utarget_pids = pid_set.target_pids; > + size = num_pids * sizeof(pid_t); > + > + if (!num_pids) > + return NULL; > + > + if (num_pids < 0 || num_pids > task_pid(current)->level + 1) > + return ERR_PTR(-EINVAL); What happens if (num_pids < task_pid(current->level) + 1) ? Unless I missed something elsewhere, I suspect it will oops, because in patch #4, you had: for (i = ns->level; i >= 0; i--) { - nr = alloc_pidmap(tmp, 0); + tpid = 0; + if (target_pids) + tpid = target_pids[i]; + + nr = alloc_pidmap(tmp, tpid); if (nr < 0) goto out_free; In general, can a task figure out it's depth in the pid-ns hierarchy ? I'm thinking of a case in which a checkpoint was taken of a (flat) container which is in depth 2 of the global hierarchy, and then it is restarted in depth 3, or in depth 1. Perhaps the semantics of the syscall should be that target_pids will indicate the desired pids _from the bottom up_. A value of "0" means "don't care" and let the kernel assign. The remaining levels (upwards) will be treated as zeros. [...] Oren. ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <4A1EA73F.1080802-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <4A1EA73F.1080802-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2009-05-28 15:14 ` Serge E. Hallyn [not found] ` <20090528151444.GA17772-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-28 17:30 ` Sukadev Bhattiprolu 1 sibling, 1 reply; 29+ messages in thread From: Serge E. Hallyn @ 2009-05-28 15:14 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, Sukadev Bhattiprolu, David C. Hansen Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org): > In general, can a task figure out it's depth in the pid-ns hierarchy ? In user-space? No. That's why the task has to be allowed to specify only 2 pids, and have that mean 'auto-select all other pids'. -serge ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20090528151444.GA17772-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090528151444.GA17772-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-28 17:01 ` Sukadev Bhattiprolu [not found] ` <20090528170103.GA26183-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-28 17:01 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Containers, David C. Hansen Serge E. Hallyn [serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote: | Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org): | > In general, can a task figure out it's depth in the pid-ns hierarchy ? | | In user-space? No. Not from any existing user interface. But hypothetically or in the long-term can/should sys_checkpoint() not be able to figure out if it has to C/R say a bash shell that has nested pid namespaces ? I am guessing sys_checkpoint() can know if its crossing a pid-ns boundary by comparing or computing the max nesting level while walking a process tree. i.e if task_pid(task)->pid_ns->level is not the same for all process in the tree, then we have nested namespaces. How to checkpoint/restart such a tree is of course a bigger challenge. But if sys_checkpoint() can find the max nesting level ? If that info is saved in checkpoint image, clone_with_pids() could use that info - no ? | | That's why the task has to be allowed to specify only 2 pids, and | have that mean 'auto-select all other pids'. | | -serge ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20090528170103.GA26183-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090528170103.GA26183-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-28 17:47 ` Serge E. Hallyn [not found] ` <20090528174708.GA2236-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Serge E. Hallyn @ 2009-05-28 17:47 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org): > Serge E. Hallyn [serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote: > | Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org): > | > In general, can a task figure out it's depth in the pid-ns hierarchy ? > | > | In user-space? No. > > Not from any existing user interface. But hypothetically or in the > long-term can/should sys_checkpoint() not be able to figure out if > it has to C/R say a bash shell that has nested pid namespaces ? > > I am guessing sys_checkpoint() can know if its crossing a pid-ns > boundary by comparing or computing the max nesting level while > walking a process tree. i.e if task_pid(task)->pid_ns->level is > not the same for all process in the tree, then we have nested > namespaces. How to checkpoint/restart such a tree is of course a > bigger challenge. > > But if sys_checkpoint() can find the max nesting level ? If that > info is saved in checkpoint image, clone_with_pids() could > use that info - no ? If a container at pidns level 2, with 1 more pidns underneath it, is checkpointed, then the checkpoint image should only reflect the base and child pidns. The two ancestor pidns levels should not be there. -serge ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20090528174708.GA2236-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090528174708.GA2236-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2009-05-28 18:00 ` Sukadev Bhattiprolu [not found] ` <20090528180057.GA27191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-28 18:00 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Containers, David C. Hansen Serge E. Hallyn [serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org] wrote: | | If a container at pidns level 2, with 1 more pidns underneath it, | is checkpointed, then the checkpoint image should only reflect | the base and child pidns. The two ancestor pidns levels should | not be there. Agree. I meant to say that we would need to specify more than 2 pids when checkpoint/restarting apps with deeply nested namespaces if any :-) Specifying bottom-up may solve the problem I guess. ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20090528180057.GA27191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090528180057.GA27191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-28 23:45 ` Oren Laadan 0 siblings, 0 replies; 29+ messages in thread From: Oren Laadan @ 2009-05-28 23:45 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Sukadev Bhattiprolu wrote: > Serge E. Hallyn [serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org] wrote: > | > | If a container at pidns level 2, with 1 more pidns underneath it, > | is checkpointed, then the checkpoint image should only reflect > | the base and child pidns. The two ancestor pidns levels should > | not be there. > > Agree. I meant to say that we would need to specify more than 2 > pids when checkpoint/restarting apps with deeply nested namespaces > if any :-) Just like the checkpoint image should only reflect the nesting from some base down, clone_with_pid() should reflect nesting from some base down. (Hmm.. ignore this if it is what you meant already ...) > > Specifying bottom-up may solve the problem I guess. Perhaps bottom-up is a misleading: I don't care about the order... I meant that a N size array should affect the last N levels of the hierarchy (if it is deeper than N). Oren. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <4A1EA73F.1080802-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2009-05-28 15:14 ` Serge E. Hallyn @ 2009-05-28 17:30 ` Sukadev Bhattiprolu [not found] ` <20090528173019.GB26183-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-28 17:30 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: | > + | > + if (num_pids < 0 || num_pids > task_pid(current)->level + 1) | > + return ERR_PTR(-EINVAL); | | What happens if (num_pids < task_pid(current->level) + 1) ? | Unless I missed something elsewhere, I suspect it will oops, Yep. the kzalloc() below this check should use: task_pid(current)->pid_ns->level+1 * sizeof(pid_t) | because in patch #4, you had: | | for (i = ns->level; i >= 0; i--) { | - nr = alloc_pidmap(tmp, 0); | + tpid = 0; | + if (target_pids) | + tpid = target_pids[i]; | + | + nr = alloc_pidmap(tmp, tpid); | if (nr < 0) | goto out_free; | | In general, can a task figure out it's depth in the pid-ns hierarchy ? | | I'm thinking of a case in which a checkpoint was taken of a (flat) | container which is in depth 2 of the global hierarchy, and then it | is restarted in depth 3, or in depth 1. | | Perhaps the semantics of the syscall should be that target_pids will | indicate the desired pids _from the bottom up_. A value of "0" means Hmm, I thought bottom up would be confusing, but I guess that would work better :-) Will fix and resend. | "don't care" and let the kernel assign. The remaining levels (upwards) | will be treated as zeros. | | [...] | | Oren. | ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20090528173019.GB26183-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090528173019.GB26183-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-28 23:47 ` Oren Laadan [not found] ` <4A1F228C.2020201-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Oren Laadan @ 2009-05-28 23:47 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Sukadev Bhattiprolu wrote: > Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: > > | > + > | > + if (num_pids < 0 || num_pids > task_pid(current)->level + 1) > | > + return ERR_PTR(-EINVAL); > | > | What happens if (num_pids < task_pid(current->level) + 1) ? > | Unless I missed something elsewhere, I suspect it will oops, > > Yep. the kzalloc() below this check should use: > > task_pid(current)->pid_ns->level+1 * sizeof(pid_t) > > | because in patch #4, you had: > | > | for (i = ns->level; i >= 0; i--) { > | - nr = alloc_pidmap(tmp, 0); > | + tpid = 0; > | + if (target_pids) > | + tpid = target_pids[i]; > | + > | + nr = alloc_pidmap(tmp, tpid); > | if (nr < 0) > | goto out_free; > | > | In general, can a task figure out it's depth in the pid-ns hierarchy ? > | > | I'm thinking of a case in which a checkpoint was taken of a (flat) > | container which is in depth 2 of the global hierarchy, and then it > | is restarted in depth 3, or in depth 1. > | > | Perhaps the semantics of the syscall should be that target_pids will > | indicate the desired pids _from the bottom up_. A value of "0" means > > Hmm, I thought bottom up would be confusing, but I guess that would > work better :-) Will fix and resend. (Repeating...) Perhaps bottom-up is a misleading: I don't care about the order... I meant that a N size array should affect the last N levels of the hierarchy (if it is deeper than N). You could actually keep everything else as is, and change the code that copies from user to always allocate a ns->level+1 size array, and copy the user data to the end of it, zeroing the beginning. Oren. > > | "don't care" and let the kernel assign. The remaining levels (upwards) > | will be treated as zeros. > | > | [...] > | > | Oren. > | > ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <4A1F228C.2020201-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <4A1F228C.2020201-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2009-05-29 1:16 ` Sukadev Bhattiprolu 2009-05-29 3:05 ` Sukadev Bhattiprolu 1 sibling, 0 replies; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-29 1:16 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: | | | Sukadev Bhattiprolu wrote: | > Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: | > | > | > + | > | > + if (num_pids < 0 || num_pids > task_pid(current)->level + 1) | > | > + return ERR_PTR(-EINVAL); | > | | > | What happens if (num_pids < task_pid(current->level) + 1) ? | > | Unless I missed something elsewhere, I suspect it will oops, | > | > Yep. the kzalloc() below this check should use: | > | > task_pid(current)->pid_ns->level+1 * sizeof(pid_t) | > | > | because in patch #4, you had: | > | | > | for (i = ns->level; i >= 0; i--) { | > | - nr = alloc_pidmap(tmp, 0); | > | + tpid = 0; | > | + if (target_pids) | > | + tpid = target_pids[i]; | > | + | > | + nr = alloc_pidmap(tmp, tpid); | > | if (nr < 0) | > | goto out_free; | > | | > | In general, can a task figure out it's depth in the pid-ns hierarchy ? | > | | > | I'm thinking of a case in which a checkpoint was taken of a (flat) | > | container which is in depth 2 of the global hierarchy, and then it | > | is restarted in depth 3, or in depth 1. | > | | > | Perhaps the semantics of the syscall should be that target_pids will | > | indicate the desired pids _from the bottom up_. A value of "0" means | > | > Hmm, I thought bottom up would be confusing, but I guess that would | > work better :-) Will fix and resend. | | (Repeating...) | | Perhaps bottom-up is a misleading: I don't care about the order... | I meant that a N size array should affect the last N levels of | the hierarchy (if it is deeper than N). Agree. | | You could actually keep everything else as is, and change the code | that copies from user to always allocate a ns->level+1 size array, | and copy the user data to the end of it, zeroing the beginning. Yep, good idea. Sukadev ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <4A1F228C.2020201-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2009-05-29 1:16 ` Sukadev Bhattiprolu @ 2009-05-29 3:05 ` Sukadev Bhattiprolu [not found] ` <20090529030558.GA2548-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-29 3:05 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen Here is an updated patch with hopefully some useful comments on why we copy to the end of target_pids[] (comments were harder to write than the code :-) --- From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Date: Mon, 4 May 2009 01:17:45 -0700 Subject: [PATCH 7/7] Define clone_with_pids syscall Container restart requires that a task have the same pid it had when it was checkpointed. When containers are nested the tasks within the containers exist in multiple pid namespaces and hence have multiple pids to specify during restart. clone_with_pids(), intended for use during restart, is the same as clone(), except that it takes a 'target_pid_set' paramter. This parameter lets caller choose specific pid numbers for the child process, in the process's active and descendant pid namespaces. Since the application does/can not know of any ancestor pid namespaces, it cannot choose a pid in those namespaces. Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to prevent unprivileged processes from misusing this interface. Call clone_with_pids as follows: pid_t pids[] = { 0, 77, 99 }; struct target_pid_set pid_set; pid_set.num_pids = sizeof(pids) / sizeof(int); pid_set.target_pids = &pids; syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set); If a target-pid is 0, the kernel continues to assign a pid for the process in that namespace. In the above example, pids[0] is 0, meaning the kernel will assign next available pid to the process in init_pid_ns. But kernel will assign pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either 77 or 99 are taken, the system call fails with -EBUSY. If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces, the system call fails with -EINVAL. Its mostly an exploratory patch seeking feedback on the interface. NOTE: Compared to clone(), clone_with_pids() needs to pass in two more pieces of information: - number of pids in the set - user buffer containing the list of pids. But since clone() already takes 5 parameters, use a 'struct target_pid_set'. TODO: - Gently tested. - May need additional sanity checks in do_fork_with_pids(). - Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in the namespace is either 1 or 0). Changelog[v2]: - (Oren Laadan) Specified target pids should apply to youngest pid-namespaces only (see comments in copy_target_pids()) - Remove unnecessary printk and add a note to callers of copy_target_pids() to free target_pids. - (Matt Helsley) Update patch description. - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description. - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and 'num_pids == 0' (fall back to normal clone()). - Move arch-independent code (sanity checks and copy-in of target-pids) into kernel/fork.c and simplify sys_clone_with_pids() Changelog[v1]: - Fixed some compile errors (had fixed these errors earlier in my git tree but had not refreshed patches before emailing them) Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- arch/x86/include/asm/syscalls.h | 1 + arch/x86/include/asm/unistd_32.h | 1 + arch/x86/kernel/entry_32.S | 1 + arch/x86/kernel/process_32.c | 21 +++++++ arch/x86/kernel/syscall_table_32.S | 1 + kernel/fork.c | 107 +++++++++++++++++++++++++++++++++++- 6 files changed, 131 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 7043408..1fdc149 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -31,6 +31,7 @@ asmlinkage int sys_get_thread_area(struct user_desc __user *); /* kernel/process_32.c */ int sys_fork(struct pt_regs *); int sys_clone(struct pt_regs *); +int sys_clone_with_pids(struct pt_regs *); int sys_vfork(struct pt_regs *); int sys_execve(struct pt_regs *); diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index 6e72d74..90f906f 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -340,6 +340,7 @@ #define __NR_inotify_init1 332 #define __NR_preadv 333 #define __NR_pwritev 334 +#define __NR_clone_with_pids 335 #ifdef __KERNEL__ diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index c929add..ee92b0d 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -707,6 +707,7 @@ ptregs_##name: \ PTREGSCALL(iopl) PTREGSCALL(fork) PTREGSCALL(clone) +PTREGSCALL(clone_with_pids) PTREGSCALL(vfork) PTREGSCALL(execve) PTREGSCALL(sigaltstack) diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 76f8f84..1efc3de 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -445,6 +445,27 @@ int sys_clone(struct pt_regs *regs) return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr); } +int sys_clone_with_pids(struct pt_regs *regs) +{ + unsigned long clone_flags; + unsigned long newsp; + int __user *parent_tidptr; + int __user *child_tidptr; + void __user *upid_setp; + + clone_flags = regs->bx; + newsp = regs->cx; + parent_tidptr = (int __user *)regs->dx; + child_tidptr = (int __user *)regs->di; + upid_setp = (void __user *)regs->bp; + + if (!newsp) + newsp = regs->sp; + + return do_fork_with_pids(clone_flags, newsp, regs, 0, parent_tidptr, + child_tidptr, upid_setp); +} + /* * sys_execve() executes a new program. */ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index ff5c873..94c1a58 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -334,3 +334,4 @@ ENTRY(sys_call_table) .long sys_inotify_init1 .long sys_preadv .long sys_pwritev + .long ptregs_clone_with_pids /* 335 */ diff --git a/kernel/fork.c b/kernel/fork.c index a16ef7b..06b1583 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1335,6 +1335,84 @@ struct task_struct * __cpuinit fork_idle(int cpu) } /* + * If user specified any 'target-pids' in @upid_setp, copy them from + * user and return a pointer to a local copy of the list of pids. The + * caller must free the list, when they are done using it. + * + * If user did not specify any target pids, return NULL (caller should + * treat this like normal clone). + * + * On any errors, return the error code + */ +static pid_t *copy_target_pids(void __user *upid_setp) +{ + int j; + int rc; + int size; + int num_pids; + int nesting; + pid_t __user *utarget_pids; + pid_t *target_pids; + struct target_pid_set pid_set; + + if(!upid_setp) + return NULL; + + if (copy_from_user(&pid_set, upid_setp, sizeof(pid_set))) + return ERR_PTR(-EFAULT); + + num_pids = pid_set.num_pids; + utarget_pids = pid_set.target_pids; + nesting = task_pid(current)->level + 1; + + if (!num_pids) + return NULL; + + if (num_pids < 0 || num_pids > nesting) + return ERR_PTR(-EINVAL); + + target_pids = kzalloc((nesting * sizeof(pid_t)), GFP_KERNEL); + if (!target_pids) + return ERR_PTR(-ENOMEM); + + /* + * A process running in a level-1 pid namespace has two pid + * namespaces and hence two pid numbers. If this process is + * checkpointed, information about these two namespaces are + * saved. We refer to these namespaces as 'known namespaces'. + * + * If this checkpointed process is however restarted in a + * level-2 pid namespace, the restarted process has an extra + * ancestor pid namespace (i.e 'unknown namespace'). + * + * During restart, the process requests specific pids for its + * 'known namespaces' and lets kernel assign pids to its 'unknown + * namespaces'. + * + * Since the requested-pids correspond to 'known namespaces' and + * since 'known-namespaces' are younger than (i.e descendants of) + * 'unknown-namespaces', copy requested pids to end of target_pids[] + * and copy zeroes to the beginning (so kernel can assign a pid for + * the unknown namespaces). + * + * NOTE: The order of pids in target_pids[] is oldest pid namespace + * to youngest (i.e target_pids[0] corresponds to init_pid_ns). + */ + j = nesting - num_pids; + size = num_pids * sizeof(pid_t); + + rc = -EFAULT; + if (copy_from_user(&target_pids[j], pid_set.target_pids, size)) + goto out_free; + + return target_pids; + +out_free: + kfree(target_pids); + return ERR_PTR(rc); +} + +/* * Ok, this is the main fork-routine. * * It copies the process, and if successful kick-starts @@ -1351,7 +1429,7 @@ long do_fork_with_pids(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; - pid_t *target_pids = NULL; + pid_t *target_pids; /* * Do some preliminary argument and permissions checking before we @@ -1385,6 +1463,29 @@ long do_fork_with_pids(unsigned long clone_flags, } } + target_pids = copy_target_pids(pid_setp); + + if (target_pids) { + if (IS_ERR(target_pids)) + return PTR_ERR(target_pids); + + nr = -EPERM; + if (!capable(CAP_SYS_ADMIN)) + goto out_free; + + /* + * CLONE_NEWPID implies pid == 1 + * + * TODO: Should this be more fine-grained ? (i.e would we want + * to have a container-init have a specific pid in an + * ancestor namespace ?) Maybe needed to checkpoint/ + * restart an application that has a nested container. + */ + nr = -EINVAL; + if (clone_flags & CLONE_NEWPID) + goto out_free; + } + /* * When called from kernel_thread, don't do user tracing stuff. */ @@ -1446,6 +1547,10 @@ long do_fork_with_pids(unsigned long clone_flags, } else { nr = PTR_ERR(p); } + +out_free: + kfree(target_pids); + return nr; } -- 1.5.2.5 ^ permalink raw reply related [flat|nested] 29+ messages in thread
[parent not found: <20090529030558.GA2548-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090529030558.GA2548-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-29 5:29 ` Oren Laadan [not found] ` <20090529054645.GA3344@us.ibm.com> 0 siblings, 1 reply; 29+ messages in thread From: Oren Laadan @ 2009-05-29 5:29 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Sukadev Bhattiprolu wrote: > Here is an updated patch with hopefully some useful comments on why > we copy to the end of target_pids[] (comments were harder to write > than the code :-) > > --- > From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Date: Mon, 4 May 2009 01:17:45 -0700 > Subject: [PATCH 7/7] Define clone_with_pids syscall > > Container restart requires that a task have the same pid it had when it was > checkpointed. When containers are nested the tasks within the containers > exist in multiple pid namespaces and hence have multiple pids to specify > during restart. > > clone_with_pids(), intended for use during restart, is the same as clone(), > except that it takes a 'target_pid_set' paramter. This parameter lets caller > choose specific pid numbers for the child process, in the process's active > and descendant pid namespaces. Since the application does/can not know of any > ancestor pid namespaces, it cannot choose a pid in those namespaces. I think the term "descendant pid namespaces" is confusing, because it can be interpreted as the collection of all descendant namespaces (e.g. by all children of a task), which is a tree. And I'm unsure what does "ancestor pid namespaces" mean. I'd say that both "descendant" and "ancestor" here are defined with respect to the nesting level of root task of a restart. Or otherwise say explicitly that it's relative to the nesting level at which the restart operation occurs. In my mind, clone_with_pid() is performed from "within" the deepest level - in which the child will "live" - and the pids array then works "bottom-up" in the sense that it indicates the desired pids as you go up the ancestry chain (going up is always well defined). Right now the restart with a flat pid-ns works by first devising a schedule and then following that schedule with forks/clones to generate a process tree. To also restore pids, we need only use clone_with_pid() with an array of size 1, and we're good. To restart nested pid-ns from userspace, we need to devise a schedule that will command a sequence of fork/clones. The schedule will tell when to use the CLONE_NEWPID to create a sub-pid-ns. When that happens, we'll increment the size of the pids array - for that clone and all subsequent clones by that task and its descendents. So the clone_with_pids occurs in the context of the deepest pid-ns, so to speak, and the arrays of pids works its way "upwards" (That's probably what you meant, anyway). > > Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to > prevent unprivileged processes from misusing this interface. > > Call clone_with_pids as follows: > > pid_t pids[] = { 0, 77, 99 }; > struct target_pid_set pid_set; > > pid_set.num_pids = sizeof(pids) / sizeof(int); > pid_set.target_pids = &pids; > > syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set); > > If a target-pid is 0, the kernel continues to assign a pid for the process in > that namespace. In the above example, pids[0] is 0, meaning the kernel will > assign next available pid to the process in init_pid_ns. But kernel will assign > pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either > 77 or 99 are taken, the system call fails with -EBUSY. > > If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces, > the system call fails with -EINVAL. > > Its mostly an exploratory patch seeking feedback on the interface. > > NOTE: > Compared to clone(), clone_with_pids() needs to pass in two more > pieces of information: > > - number of pids in the set > - user buffer containing the list of pids. > > But since clone() already takes 5 parameters, use a 'struct > target_pid_set'. > > TODO: > - Gently tested. > - May need additional sanity checks in do_fork_with_pids(). > - Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in > the namespace is either 1 or 0). > > Changelog[v2]: > - (Oren Laadan) Specified target pids should apply to youngest > pid-namespaces only (see comments in copy_target_pids()) > - Remove unnecessary printk and add a note to callers of > copy_target_pids() to free target_pids. > - (Matt Helsley) Update patch description. > - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description. > - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and > 'num_pids == 0' (fall back to normal clone()). > - Move arch-independent code (sanity checks and copy-in of target-pids) > into kernel/fork.c and simplify sys_clone_with_pids() > > Changelog[v1]: > - Fixed some compile errors (had fixed these errors earlier in my > git tree but had not refreshed patches before emailing them) > > Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> > --- [...] > diff --git a/kernel/fork.c b/kernel/fork.c > index a16ef7b..06b1583 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1335,6 +1335,84 @@ struct task_struct * __cpuinit fork_idle(int cpu) > } > > /* > + * If user specified any 'target-pids' in @upid_setp, copy them from > + * user and return a pointer to a local copy of the list of pids. The > + * caller must free the list, when they are done using it. > + * > + * If user did not specify any target pids, return NULL (caller should > + * treat this like normal clone). > + * > + * On any errors, return the error code > + */ > +static pid_t *copy_target_pids(void __user *upid_setp) > +{ > + int j; > + int rc; > + int size; > + int num_pids; > + int nesting; > + pid_t __user *utarget_pids; > + pid_t *target_pids; > + struct target_pid_set pid_set; > + > + if(!upid_setp) > + return NULL; > + > + if (copy_from_user(&pid_set, upid_setp, sizeof(pid_set))) > + return ERR_PTR(-EFAULT); > + > + num_pids = pid_set.num_pids; > + utarget_pids = pid_set.target_pids; > + nesting = task_pid(current)->level + 1; I should have mentioned earlier, but there is also the case of CLONE_NEWPID. If CLONE_NEWPID is given, then @nesting should be plus one, _and_ the corresponding pid must be 1 or 0. And this consideration deserves fat comment :) > + > + if (!num_pids) > + return NULL; > + > + if (num_pids < 0 || num_pids > nesting) > + return ERR_PTR(-EINVAL); > + > + target_pids = kzalloc((nesting * sizeof(pid_t)), GFP_KERNEL); > + if (!target_pids) > + return ERR_PTR(-ENOMEM); > + > + /* > + * A process running in a level-1 pid namespace has two pid > + * namespaces and hence two pid numbers. If this process is > + * checkpointed, information about these two namespaces are > + * saved. We refer to these namespaces as 'known namespaces'. > + * > + * If this checkpointed process is however restarted in a > + * level-2 pid namespace, the restarted process has an extra > + * ancestor pid namespace (i.e 'unknown namespace'). > + * > + * During restart, the process requests specific pids for its > + * 'known namespaces' and lets kernel assign pids to its 'unknown > + * namespaces'. > + * > + * Since the requested-pids correspond to 'known namespaces' and > + * since 'known-namespaces' are younger than (i.e descendants of) > + * 'unknown-namespaces', copy requested pids to end of target_pids[] > + * and copy zeroes to the beginning (so kernel can assign a pid for > + * the unknown namespaces). > + * > + * NOTE: The order of pids in target_pids[] is oldest pid namespace > + * to youngest (i.e target_pids[0] corresponds to init_pid_ns). > + */ Reads good! [...] Oren. ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20090529054645.GA3344@us.ibm.com>]
[parent not found: <20090529054645.GA3344-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090529054645.GA3344-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-29 5:54 ` Oren Laadan [not found] ` <4A1F78AF.6030404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Oren Laadan @ 2009-05-29 5:54 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Sukadev Bhattiprolu wrote: > Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: > | > | > | Sukadev Bhattiprolu wrote: > | > Here is an updated patch with hopefully some useful comments on why > | > we copy to the end of target_pids[] (comments were harder to write > | > than the code :-) > | > > | > --- > | > From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > | > Date: Mon, 4 May 2009 01:17:45 -0700 > | > Subject: [PATCH 7/7] Define clone_with_pids syscall > | > > | > Container restart requires that a task have the same pid it had when it was > | > checkpointed. When containers are nested the tasks within the containers > | > exist in multiple pid namespaces and hence have multiple pids to specify > | > during restart. > | > > | > clone_with_pids(), intended for use during restart, is the same as clone(), > | > except that it takes a 'target_pid_set' paramter. This parameter lets caller > | > choose specific pid numbers for the child process, in the process's active > | > and descendant pid namespaces. Since the application does/can not know of any > | > ancestor pid namespaces, it cannot choose a pid in those namespaces. > > Grr, I got this backwards. The application chooses pid numbers in active > and ancestor namespaces. It can/does NOT need a pid in descendant > namespaces. I had it wrong in the code too and fixed it but missed the > patch description. > > > | > | I think the term "descendant pid namespaces" is confusing, because it > | can be interpreted as the collection of all descendant namespaces (e.g. > | by all children of a task), which is a tree. And I'm unsure what does > | "ancestor pid namespaces" mean. > | > | I'd say that both "descendant" and "ancestor" here are defined with > | respect to the nesting level of root task of a restart. Or otherwise > | say explicitly that it's relative to the nesting level at which the > | restart operation occurs. > | > | In my mind, clone_with_pid() is performed from "within" the deepest > | level - in which the child will "live" - and the pids array then > | works "bottom-up" in the sense that it indicates the desired pids > | as you go up the ancestry chain (going up is always well defined). > | > | Right now the restart with a flat pid-ns works by first devising a > | schedule and then following that schedule with forks/clones to > | generate a process tree. > | > | To also restore pids, we need only use clone_with_pid() with an array > | of size 1, and we're good. > | > | To restart nested pid-ns from userspace, we need to devise a schedule > | that will command a sequence of fork/clones. The schedule will tell > | when to use the CLONE_NEWPID to create a sub-pid-ns. When that happens, > | we'll increment the size of the pids array - for that clone and all > | subsequent clones by that task and its descendents. > | > | So the clone_with_pids occurs in the context of the deepest pid-ns, so > | to speak, and the arrays of pids works its way "upwards" > | > | (That's probably what you meant, anyway). > | > | > > | > Unlike clone(), clone_with_pids() needs CAP_SYS_ADMIN, at least for now, to > | > prevent unprivileged processes from misusing this interface. > | > > | > Call clone_with_pids as follows: > | > > | > pid_t pids[] = { 0, 77, 99 }; > | > struct target_pid_set pid_set; > | > > | > pid_set.num_pids = sizeof(pids) / sizeof(int); > | > pid_set.target_pids = &pids; > | > > | > syscall(__NR_clone_with_pids, flags, stack, NULL, NULL, NULL, &pid_set); > | > > | > If a target-pid is 0, the kernel continues to assign a pid for the process in > | > that namespace. In the above example, pids[0] is 0, meaning the kernel will > | > assign next available pid to the process in init_pid_ns. But kernel will assign > | > pid 77 in the child pid namespace 1 and pid 99 in pid namespace 2. If either > | > 77 or 99 are taken, the system call fails with -EBUSY. > | > > | > If 'pid_set.num_pids' exceeds the current nesting level of pid namespaces, > | > the system call fails with -EINVAL. > | > > | > Its mostly an exploratory patch seeking feedback on the interface. > | > > | > NOTE: > | > Compared to clone(), clone_with_pids() needs to pass in two more > | > pieces of information: > | > > | > - number of pids in the set > | > - user buffer containing the list of pids. > | > > | > But since clone() already takes 5 parameters, use a 'struct > | > target_pid_set'. > | > > | > TODO: > | > - Gently tested. > | > - May need additional sanity checks in do_fork_with_pids(). > | > - Allow CLONE_NEWPID() with clone_with_pids() (ensure target-pid in > | > the namespace is either 1 or 0). > | > > | > Changelog[v2]: > | > - (Oren Laadan) Specified target pids should apply to youngest > | > pid-namespaces only (see comments in copy_target_pids()) > | > - Remove unnecessary printk and add a note to callers of > | > copy_target_pids() to free target_pids. > | > - (Matt Helsley) Update patch description. > | > - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description. > | > - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and > | > 'num_pids == 0' (fall back to normal clone()). > | > - Move arch-independent code (sanity checks and copy-in of target-pids) > | > into kernel/fork.c and simplify sys_clone_with_pids() > | > > | > Changelog[v1]: > | > - Fixed some compile errors (had fixed these errors earlier in my > | > git tree but had not refreshed patches before emailing them) > | > > | > Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> > | > Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> > | > --- > | > | [...] > | > | > | > diff --git a/kernel/fork.c b/kernel/fork.c > | > index a16ef7b..06b1583 100644 > | > --- a/kernel/fork.c > | > +++ b/kernel/fork.c > | > @@ -1335,6 +1335,84 @@ struct task_struct * __cpuinit fork_idle(int cpu) > | > } > | > > | > /* > | > + * If user specified any 'target-pids' in @upid_setp, copy them from > | > + * user and return a pointer to a local copy of the list of pids. The > | > + * caller must free the list, when they are done using it. > | > + * > | > + * If user did not specify any target pids, return NULL (caller should > | > + * treat this like normal clone). > | > + * > | > + * On any errors, return the error code > | > + */ > | > +static pid_t *copy_target_pids(void __user *upid_setp) > | > +{ > | > + int j; > | > + int rc; > | > + int size; > | > + int num_pids; > | > + int nesting; > | > + pid_t __user *utarget_pids; > | > + pid_t *target_pids; > | > + struct target_pid_set pid_set; > | > + > | > + if(!upid_setp) > | > + return NULL; > | > + > | > + if (copy_from_user(&pid_set, upid_setp, sizeof(pid_set))) > | > + return ERR_PTR(-EFAULT); > | > + > | > + num_pids = pid_set.num_pids; > | > + utarget_pids = pid_set.target_pids; > | > + nesting = task_pid(current)->level + 1; > | > | I should have mentioned earlier, but there is also the case of > | CLONE_NEWPID. If CLONE_NEWPID is given, then @nesting should be > | plus one, _and_ the corresponding pid must be 1 or 0. > > Right. do_fork_with_pids() checks if CLONE_NEWPID is specified with > target_pids and returns -EINVAL for now. Any reason not to handle this case already ? I have a simpler suggestion than above: pass the clone_flags to copy_target_pids(), and in there, if CLONE_NEWPID is set, then you should allocate an array +1 in size, and force last slot to be 0 (or 1). User doesn't have to pass a larger array. Oren. > | > | And this consideration deserves fat comment :) > | > | > + > | > + if (!num_pids) > | > + return NULL; > | > + > | > + if (num_pids < 0 || num_pids > nesting) > | > + return ERR_PTR(-EINVAL); > | > + > | > + target_pids = kzalloc((nesting * sizeof(pid_t)), GFP_KERNEL); > | > + if (!target_pids) > | > + return ERR_PTR(-ENOMEM); > | > + > | > + /* > | > + * A process running in a level-1 pid namespace has two pid > | > + * namespaces and hence two pid numbers. If this process is > | > + * checkpointed, information about these two namespaces are > | > + * saved. We refer to these namespaces as 'known namespaces'. > | > + * > | > + * If this checkpointed process is however restarted in a > | > + * level-2 pid namespace, the restarted process has an extra > | > + * ancestor pid namespace (i.e 'unknown namespace'). > | > + * > | > + * During restart, the process requests specific pids for its > | > + * 'known namespaces' and lets kernel assign pids to its 'unknown > | > + * namespaces'. > | > + * > | > + * Since the requested-pids correspond to 'known namespaces' and > | > + * since 'known-namespaces' are younger than (i.e descendants of) > | > + * 'unknown-namespaces', copy requested pids to end of target_pids[] > | > + * and copy zeroes to the beginning (so kernel can assign a pid for > | > + * the unknown namespaces). > | > + * > | > + * NOTE: The order of pids in target_pids[] is oldest pid namespace > | > + * to youngest (i.e target_pids[0] corresponds to init_pid_ns). > | > + */ > | > | Reads good! > | > | [...] > | > | Oren. > ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <4A1F78AF.6030404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <4A1F78AF.6030404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2009-05-29 17:06 ` Sukadev Bhattiprolu [not found] ` <20090529170616.GA12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-29 17:06 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: | > | I should have mentioned earlier, but there is also the case of | > | CLONE_NEWPID. If CLONE_NEWPID is given, then @nesting should be | > | plus one, _and_ the corresponding pid must be 1 or 0. | > | > Right. do_fork_with_pids() checks if CLONE_NEWPID is specified with | > target_pids and returns -EINVAL for now. | | Any reason not to handle this case already ? The only reason was that we are not planning on supporting C/R of nested containers for a while, but yes, we don't need to restrict clone_with_pids(). | | I have a simpler suggestion than above: pass the clone_flags to | copy_target_pids(), and in there, if CLONE_NEWPID is set, then | you should allocate an array +1 in size, and force last slot to | be 0 (or 1). User doesn't have to pass a larger array. Looks like it would be cleaner code-wise, to always allocate an extra element in the target_pids list and leave the last one set to 0. If CLONE_NEWPID is set, alloc_pid() will assign the first pid in the new namespace. If it is not set, the last element will never be referenced in alloc_pid(). ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20090529170616.GA12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090529170616.GA12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-29 19:34 ` Sukadev Bhattiprolu [not found] ` <20090529193416.GB12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-29 19:34 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen Sukadev Bhattiprolu [sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org] wrote: | Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: I am not sure what the semantics should be for this case: - checkpoint a process that is in level-3 pid namespace - restart in a level-2 or level-1 pid namespace clone_with_pids() will fail now since number of pids specified would be 4 but kernel expects only 2 or 3. mktree/restart program cannot figure out current nesting to trim the target-pids. Should we remove the check of user-pids exceeding the current nesting level and simply ignore the pids from the older namespaces ? Sukadev ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20090529193416.GB12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090529193416.GB12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-29 20:01 ` Oren Laadan [not found] ` <4A203F2E.1060807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Oren Laadan @ 2009-05-29 20:01 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Sukadev Bhattiprolu wrote: > Sukadev Bhattiprolu [sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org] wrote: > | Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: > > I am not sure what the semantics should be for this case: > > - checkpoint a process that is in level-3 pid namespace > - restart in a level-2 or level-1 pid namespace > Meaning: a container root was at level-3, so tasks in the container were level-3 through level-(3+N), where N is the in-container depth so to speak. Then it was restarted such that the base became level-2 or level-1. I think we already covered this. > clone_with_pids() will fail now since number of pids specified would > be 4 but kernel expects only 2 or 3. mktree/restart program cannot > figure out current nesting to trim the target-pids. > > Should we remove the check of user-pids exceeding the current nesting > level and simply ignore the pids from the older namespaces ? It seems to me that the current behavior is correct: I can't think of a case where trimming (silently) would make sense, or where a program would end up giving more pids that it's nesting level. Oren. ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <4A203F2E.1060807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <4A203F2E.1060807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2009-05-29 21:19 ` Sukadev Bhattiprolu [not found] ` <20090529211922.GC12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 29+ messages in thread From: Sukadev Bhattiprolu @ 2009-05-29 21:19 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, David C. Hansen Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: | | | Sukadev Bhattiprolu wrote: | > Sukadev Bhattiprolu [sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org] wrote: | > | Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: | > | > I am not sure what the semantics should be for this case: | > | > - checkpoint a process that is in level-3 pid namespace | > - restart in a level-2 or level-1 pid namespace | > | | Meaning: a container root was at level-3, so tasks in the container | were level-3 through level-(3+N), where N is the in-container depth | so to speak. Then it was restarted such that the base became level-2 | or level-1. | | I think we already covered this. | Sorry, I meant level3 not 'level minus 3' :-) Restating, suppose init_pid_ns is L0, and L3 is a pid namespace 3 levels deep (i.e the process has 4 pids at checkpoint time). When restarting, if the process only needs 2 pids bc it is L1, current behavior is to return -EINVAL. I have this check in copy_target_pids(): + if (num_pids < 0 || num_pids > nesting) + return ERR_PTR(-EINVAL); Is it ok to return -EINVAL when num_pids > nesting ? ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20090529211922.GC12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 7/7][v2] Define clone_with_pids syscall [not found] ` <20090529211922.GC12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-29 21:32 ` Oren Laadan 0 siblings, 0 replies; 29+ messages in thread From: Oren Laadan @ 2009-05-29 21:32 UTC (permalink / raw) To: Sukadev Bhattiprolu; +Cc: Containers, David C. Hansen Sukadev Bhattiprolu wrote: > Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: > | > | > | Sukadev Bhattiprolu wrote: > | > Sukadev Bhattiprolu [sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org] wrote: > | > | Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: > | > > | > I am not sure what the semantics should be for this case: > | > > | > - checkpoint a process that is in level-3 pid namespace > | > - restart in a level-2 or level-1 pid namespace > | > > | > | Meaning: a container root was at level-3, so tasks in the container > | were level-3 through level-(3+N), where N is the in-container depth > | so to speak. Then it was restarted such that the base became level-2 > | or level-1. > | > | I think we already covered this. > | > > Sorry, I meant level3 not 'level minus 3' :-) Restating, suppose > init_pid_ns is L0, and L3 is a pid namespace 3 levels deep (i.e the > process has 4 pids at checkpoint time). When restarting, if the > process only needs 2 pids bc it is L1, current behavior is to return > -EINVAL. > > I have this check in copy_target_pids(): > > + if (num_pids < 0 || num_pids > nesting) > + return ERR_PTR(-EINVAL); > > Is it ok to return -EINVAL when num_pids > nesting ? > IMHO yes. I'd keep the semantics this way. clone_with_pid() is anyway intended for restart() only, and restart need never get to that situation. In fact, checkpoint() should only record those pids from the container "base" downwards, so the depth of the original "base" is irrelevant. Besdies, the alternative is to silently ignore part of what the caller requests - which is worse. Oren. ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2009-05-29 21:32 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-28 4:37 [RFC][PATCH 1/7][v2] Factor out code to allocate pidmap page Sukadev Bhattiprolu
[not found] ` <20090528043748.GA16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 4:38 ` [RFC][PATCH 2/7][v2] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
2009-05-28 4:38 ` [RFC][PATCH 3/7][v2] Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
[not found] ` <20090528043834.GC16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 12:52 ` Serge E. Hallyn
2009-05-28 14:47 ` Oren Laadan
2009-05-28 4:38 ` [RFC][PATCH 4/7][v2] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
2009-05-28 4:39 ` [RFC][PATCH 5/7][v2] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2009-05-28 4:39 ` [RFC][PATCH 6/7][v2] Define do_fork_with_pids() Sukadev Bhattiprolu
[not found] ` <20090528043929.GF16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 12:54 ` Serge E. Hallyn
2009-05-28 15:03 ` Oren Laadan
2009-05-28 4:39 ` [RFC][PATCH 7/7][v2] Define clone_with_pids syscall Sukadev Bhattiprolu
[not found] ` <20090528043945.GG16522-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 12:02 ` Matt Helsley
2009-05-28 15:01 ` Oren Laadan
[not found] ` <4A1EA73F.1080802-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-28 15:14 ` Serge E. Hallyn
[not found] ` <20090528151444.GA17772-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 17:01 ` Sukadev Bhattiprolu
[not found] ` <20090528170103.GA26183-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 17:47 ` Serge E. Hallyn
[not found] ` <20090528174708.GA2236-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2009-05-28 18:00 ` Sukadev Bhattiprolu
[not found] ` <20090528180057.GA27191-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 23:45 ` Oren Laadan
2009-05-28 17:30 ` Sukadev Bhattiprolu
[not found] ` <20090528173019.GB26183-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-28 23:47 ` Oren Laadan
[not found] ` <4A1F228C.2020201-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-29 1:16 ` Sukadev Bhattiprolu
2009-05-29 3:05 ` Sukadev Bhattiprolu
[not found] ` <20090529030558.GA2548-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-29 5:29 ` Oren Laadan
[not found] ` <20090529054645.GA3344@us.ibm.com>
[not found] ` <20090529054645.GA3344-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-29 5:54 ` Oren Laadan
[not found] ` <4A1F78AF.6030404-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-29 17:06 ` Sukadev Bhattiprolu
[not found] ` <20090529170616.GA12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-29 19:34 ` Sukadev Bhattiprolu
[not found] ` <20090529193416.GB12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-29 20:01 ` Oren Laadan
[not found] ` <4A203F2E.1060807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-29 21:19 ` Sukadev Bhattiprolu
[not found] ` <20090529211922.GC12597-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-29 21:32 ` Oren Laadan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.