* Re: [v12][PATCH 0/9] Implement eclone() syscall
[not found] <20091111043440.GA9377@suka>
@ 2009-11-11 22:38 ` Sukadev Bhattiprolu
[not found] ` <20091111044250.GA11393@suka>
` (8 subsequent siblings)
9 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:38 UTC (permalink / raw)
To: Andrew Morton
Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan,
Pavel Emelyanov, hpa, Nathan Lynch, matthltc, arnd, roland,
mtk.manpages, linux-api, Containers, linux-kernel
Cc: LKML
Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
|
| Based on discussions on containers mailing list and IRC, we settled on
| the name eclone(). Please let me know of a better name or if there are
| other comments on the patchset.
|
| ---
|
| Subject: [v12][PATCH 0/9] Implement eclone() syscall
|
| To support application checkpoint/restart, a task must have the same pid it
| had when it was checkpointed. When containers are nested, the tasks within
| the containers exist in multiple pid namespaces and hence have multiple pids
| to specify during restart.
|
| This patchset implements a new system call, eclone() that lets a process
| specify the pids of the child process.
|
| Patches 1 through 6 are helper patches needed for choosing a pid for the
| child process.
|
| PATCH 8 implements the eclone() system call on x86. The interface defined in
| PATCH 8 has been ported to s390 and ppc64 architectures, but they will be
| posted as a separate patchset if this patchset is accepted.
|
| PATCH 9 adds some documentation on the new system call, some/all of which
| will eventually go into a man page.
|
| Changelog[v12]:
| - Ignore ->child_stack_size when ->child_stack_base is NULL (PATCH 8)
| - Cleanup/simplify example in Documentation/eclone (PATCH 9).
| - Rename sys call to a shorter name, eclone()
|
| Changelog[v11]:
| - [Dave Hansen] Move clone_args validation checks to arch-indpeendent
| code.
| - [Oren Laadan] Make args_size a parameter to system call and remove
| it from 'struct clone_args'
|
| Changelog[v10]:
| - [Linus Torvalds] Use PTREGSCALL() implementation for clone rather
| than the generic system call
| - Rename clone3() to clone_with_pids()
| - Update Documentation/clone_with_pids() to show example usage with
| the PTREGSCALL implementation.
|
| Changelog[v9]:
| - [Pavel Emelyanov] Drop the patch that made 'pid_max' a property
| of struct pid_namespace
| - [Roland McGrath, H. Peter Anvin and earlier on, Serge Hallyn] To
| avoid inadvertent truncation clone_flags, preserve the first
| parameter of clone3() as 'u32 clone_flags' and specify newer
| flags in clone_args.flags_high (PATCH 8/9 and PATCH 9/9)
| - [Eric Biederman] Generalize alloc_pidmap() code to simplify and
| remove duplication (see PATCH 3/9].
|
| Changelog[v8]:
| - [Oren Laadan, Louis Rilling, KOSAKI Motohiro]
| The name 'clone2()' is in use - renamed new syscall to clone3().
| - [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit.
| - [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0.
| (Added [PATCH 7/10] to the patchset).
|
| Changelog[v7]:
| - [Peter Zijlstra, Arnd Bergmann]
| Group the arguments to clone2() into a 'struct clone_arg' to
| workaround the issue of exceeding 6 arguments to the system call.
| Also define clone-flags as u64 to allow additional clone-flags.
|
| Changelog[v6]:
| - [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
| Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
| constant across architectures (Patches 7, 8).
| - (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
| 'unum_pids < 0' check (Patches 7,8)
| - (Pavel Machek) New patch (Patch 9) to add some documentation.
|
| Changelog[v5]:
| - Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
| into this set)
| - (Eric Biederman): Avoid the new function, set_pidmap() - added
| couple of checks on 'target_pid' in alloc_pidmap() itself.
|
| === IMPORTANT NOTE:
|
| clone() system call has another limitation - all but one bits in clone-flags
| are in use and if more new clone-flags are needed, we will need a variant of
| the clone() system call.
|
| It appears to make sense to try and extend this new system call to address
| this limitation as well. The requirements of a new clone system call could
| then be summarized as:
|
| - do everything clone() does today, and
| - give application an ability to choose pids for the child process
| in all ancestor pid namespaces, and
| - allow more clone_flags
|
| Contstraints:
|
| - system-calls are restricted to 6 parameters and clone() already
| takes 5 parameters, any extension to clone() interface would require
| one or more copy_from_user(). (Not sure if copy_from_user() of ~40
| bytes would have a significant impact on performance of clone()).
|
| Based on these requirements and constraints, we explored a couple of system
| call interfaces (in earlier versions of this patchset). Based on input from
| Arnd Bergmann and others, the new interface of the system call is:
|
| struct clone_args {
| u64 clone_flags_high;
| u64 child_stack_base;
| u64 child_stack_size;
| u64 parent_tid_ptr;
| u64 child_tid_ptr;
| u32 nr_pids;
| u32 reserved0;
| u64 reserved1;
| };
|
| sys_eclone(u32 flags_low, struct clone_args *cargs, int args_size,
| pid_t *pids)
|
| Details of the struct clone_args and the usage are explained in the
| documentation (PATCH 9/9).
|
| NOTE:
| While this patchset enables support for more clone-flags, actual
| implementation for additional clone-flags is best implemented as
| a separate patchset (PATCH 8/9 identifies some TODOs)
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [v12][PATCH 1/9] Factor out code to allocate pidmap page
[not found] ` <20091111044250.GA11393@suka>
@ 2009-11-11 22:38 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:38 UTC (permalink / raw)
To: Andrew Morton
Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
Pavel Emelyanov, linux-kernel
Cc LKML
Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
|
| Subject: [v12][PATCH 1/9] Factor out code to allocate pidmap page
|
| To simplify alloc_pidmap(), move code to allocate a pid map page to a
| separate function.
|
| Changelog[v3]:
| - Earlier version of patchset called alloc_pidmap_page() from two
| places. But now its called from only one place. Even so, moving
| this code out into a separate function simplifies alloc_pidmap().
| Changelog[v2]:
| - (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
| -ENOMEM on error instead of -1.
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Serge Hallyn <serue@us.ibm.com>
| Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
| kernel/pid.c | 45 ++++++++++++++++++++++++++++++---------------
| 1 files changed, 30 insertions(+), 15 deletions(-)
|
| diff --git a/kernel/pid.c b/kernel/pid.c
| index d3f722d..7d4bb6e 100644
| --- a/kernel/pid.c
| +++ b/kernel/pid.c
| @@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upid)
| atomic_inc(&map->nr_free);
| }
|
| +static int alloc_pidmap_page(struct pidmap *map)
| +{
| + void *page;
| +
| + if (likely(map->page))
| + return 0;
| +
| + page = kzalloc(PAGE_SIZE, GFP_KERNEL);
| +
| + /*
| + * Free the page if someone raced with us installing it:
| + */
| + spin_lock_irq(&pidmap_lock);
| + if (map->page)
| + kfree(page);
| + else
| + map->page = page;
| + spin_unlock_irq(&pidmap_lock);
| +
| + if (unlikely(!map->page))
| + return -ENOMEM;
| +
| + return 0;
| +}
| +
| static int alloc_pidmap(struct pid_namespace *pid_ns)
| {
| int i, offset, max_scan, pid, last = pid_ns->last_pid;
| + int rc;
| struct pidmap *map;
|
| pid = last + 1;
| @@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
| map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
| max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
| for (i = 0; i <= max_scan; ++i) {
| - if (unlikely(!map->page)) {
| - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
| - /*
| - * Free the page if someone raced with us
| - * installing it:
| - */
| - spin_lock_irq(&pidmap_lock);
| - if (map->page)
| - kfree(page);
| - else
| - map->page = page;
| - spin_unlock_irq(&pidmap_lock);
| - if (unlikely(!map->page))
| - break;
| - }
| + rc = alloc_pidmap_page(map);
| + if (rc)
| + break;
| +
| if (likely(atomic_read(&map->nr_free))) {
| do {
| if (!test_and_set_bit(offset, map->page)) {
| --
| 1.6.0.4
|
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [v12][PATCH 2/9] Have alloc_pidmap() return actual error code
[not found] ` <20091111044313.GB11393@suka>
@ 2009-11-11 22:39 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:39 UTC (permalink / raw)
To: Andrew Morton
Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
Pavel Emelyanov, linux-kernel
CC: LKML
Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
|
| Subject: [v12][PATCH 2/9] Have alloc_pidmap() return actual error code
|
| alloc_pidmap() can fail either because all pid numbers are in use or
| because memory allocation failed. With support for setting a specific
| pid number, alloc_pidmap() would also fail if either the given pid
| number is invalid or in use.
|
| Rather than have callers assume -ENOMEM, have alloc_pidmap() return
| the actual error.
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Serge Hallyn <serue@us.ibm.com>
| Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
| kernel/fork.c | 5 +++--
| kernel/pid.c | 14 +++++++++-----
| 2 files changed, 12 insertions(+), 7 deletions(-)
|
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 4c20fff..93626b2 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -1156,10 +1156,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
| goto bad_fork_cleanup_io;
|
| if (pid != &init_struct_pid) {
| - retval = -ENOMEM;
| pid = alloc_pid(p->nsproxy->pid_ns);
| - if (!pid)
| + if (IS_ERR(pid)) {
| + retval = PTR_ERR(pid);
| goto bad_fork_cleanup_io;
| + }
|
| if (clone_flags & CLONE_NEWPID) {
| retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
| diff --git a/kernel/pid.c b/kernel/pid.c
| index 7d4bb6e..c4d9914 100644
| --- a/kernel/pid.c
| +++ b/kernel/pid.c
| @@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidmap *map)
| static int alloc_pidmap(struct pid_namespace *pid_ns)
| {
| int i, offset, max_scan, pid, last = pid_ns->last_pid;
| - int rc;
| + int rc = -EAGAIN;
| struct pidmap *map;
|
| pid = last + 1;
| @@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
| } else {
| map = &pid_ns->pidmap[0];
| offset = RESERVED_PIDS;
| - if (unlikely(last == offset))
| + if (unlikely(last == offset)) {
| + rc = -EAGAIN;
| break;
| + }
| }
| pid = mk_pid(pid_ns, map, offset);
| }
| - return -1;
| + return rc;
| }
|
| int next_pidmap(struct pid_namespace *pid_ns, int last)
| @@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
| struct upid *upid;
|
| pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
| - if (!pid)
| + if (!pid) {
| + pid = ERR_PTR(-ENOMEM);
| goto out;
| + }
|
| tmp = ns;
| for (i = ns->level; i >= 0; i--) {
| @@ -299,7 +303,7 @@ out_free:
| free_pidmap(pid->numbers + i);
|
| kmem_cache_free(ns->pid_cachep, pid);
| - pid = NULL;
| + pid = ERR_PTR(nr);
| goto out;
| }
|
| --
| 1.6.0.4
|
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [v12][PATCH 3/9] Define set_pidmap() function
[not found] ` <20091111044329.GC11393@suka>
@ 2009-11-11 22:39 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:39 UTC (permalink / raw)
To: Andrew Morton
Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
Pavel Emelyanov, linux-kernel
Cc: LKML
Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
|
| Subject: [v12][PATCH 3/9] Define set_pidmap() function
|
| Define a set_pidmap() interface which is like alloc_pidmap() only that
| caller specifies the pid number to be assigned.
|
| Changelog[v9]:
| - Completely rewrote this patch based on Eric Biederman's code.
| Changelog[v7]:
| - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
| Changelog[v6]:
| - Separate target_pid > 0 case to minimize the number of checks needed.
| Changelog[v3]:
| - (Eric Biederman): Avoid set_pidmap() function. Added couple of
| checks for target_pid in alloc_pidmap() itself.
| Changelog[v2]:
| - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
| actually checks for 'pid <= 0' for completeness).
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
| Acked-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
| kernel/pid.c | 41 +++++++++++++++++++++++++++++++++--------
| 1 files changed, 33 insertions(+), 8 deletions(-)
|
| diff --git a/kernel/pid.c b/kernel/pid.c
| index c4d9914..afef766 100644
| --- a/kernel/pid.c
| +++ b/kernel/pid.c
| @@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map)
| return 0;
| }
|
| -static int alloc_pidmap(struct pid_namespace *pid_ns)
| +static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
| + int max)
| {
| - int i, offset, max_scan, pid, last = pid_ns->last_pid;
| + int i, offset, max_scan, pid;
| int rc = -EAGAIN;
| struct pidmap *map;
|
| pid = last + 1;
| if (pid >= pid_max)
| - pid = RESERVED_PIDS;
| + pid = min;
| offset = pid & BITS_PER_PAGE_MASK;
| map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
| - max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
| + max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
| for (i = 0; i <= max_scan; ++i) {
| rc = alloc_pidmap_page(map);
| if (rc)
| @@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
| do {
| if (!test_and_set_bit(offset, map->page)) {
| atomic_dec(&map->nr_free);
| - pid_ns->last_pid = pid;
| return pid;
| }
| offset = find_next_offset(map, offset);
| @@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
| * bitmap block and the final block was the same
| * as the starting point, pid is before last_pid.
| */
| - } while (offset < BITS_PER_PAGE && pid < pid_max &&
| + } while (offset < BITS_PER_PAGE && pid < max &&
| (i != max_scan || pid < last ||
| !((last+1) & BITS_PER_PAGE_MASK)));
| }
| - if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
| + if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
| ++map;
| offset = 0;
| } else {
| map = &pid_ns->pidmap[0];
| - offset = RESERVED_PIDS;
| + offset = min;
| if (unlikely(last == offset)) {
| rc = -EAGAIN;
| break;
| @@ -199,6 +199,31 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
| return rc;
| }
|
| +static int alloc_pidmap(struct pid_namespace *pid_ns)
| +{
| + int nr;
| +
| + nr = do_alloc_pidmap(pid_ns, pid_ns->last_pid, RESERVED_PIDS, pid_max);
| + if (nr >= 0)
| + pid_ns->last_pid = nr;
| + return nr;
| +}
| +
| +static int set_pidmap(struct pid_namespace *pid_ns, int target)
| +{
| + if (!target)
| + return alloc_pidmap(pid_ns);
| +
| + if (target >= pid_max)
| + return -EINVAL;
| +
| + if ((target < 0) || (target < RESERVED_PIDS &&
| + pid_ns->last_pid >= RESERVED_PIDS))
| + return -EINVAL;
| +
| + return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
| +}
| +
| int next_pidmap(struct pid_namespace *pid_ns, int last)
| {
| int offset;
| --
| 1.6.0.4
|
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [v12][PATCH 4/9] Add target_pids parameter to alloc_pid()
[not found] ` <20091111044347.GD11393@suka>
@ 2009-11-11 22:39 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:39 UTC (permalink / raw)
To: Andrew Morton
Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
Pavel Emelyanov, linux-kernel
Cc: LKML
Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
|
| Subject: [v12][PATCH 4/9] Add target_pids parameter to alloc_pid()
|
| This parameter is currently NULL, but will be used in a follow-on patch.
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Serge Hallyn <serue@us.ibm.com>
| Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
| include/linux/pid.h | 2 +-
| kernel/fork.c | 3 ++-
| kernel/pid.c | 9 +++++++--
| 3 files changed, 10 insertions(+), 4 deletions(-)
|
| diff --git a/include/linux/pid.h b/include/linux/pid.h
| index 49f1c2f..914185d 100644
| --- a/include/linux/pid.h
| +++ b/include/linux/pid.h
| @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
| extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
| int next_pidmap(struct pid_namespace *pid_ns, int last);
|
| -extern struct pid *alloc_pid(struct pid_namespace *ns);
| +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
| extern void free_pid(struct pid *pid);
|
| /*
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 93626b2..3f1dddf 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -980,6 +980,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
| int retval;
| struct task_struct *p;
| int cgroup_callbacks_done = 0;
| + pid_t *target_pids = NULL;
|
| if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
| return ERR_PTR(-EINVAL);
| @@ -1156,7 +1157,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
| goto bad_fork_cleanup_io;
|
| if (pid != &init_struct_pid) {
| - pid = alloc_pid(p->nsproxy->pid_ns);
| + pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
| if (IS_ERR(pid)) {
| retval = PTR_ERR(pid);
| goto bad_fork_cleanup_io;
| diff --git a/kernel/pid.c b/kernel/pid.c
| index afef766..3ee52cd 100644
| --- a/kernel/pid.c
| +++ b/kernel/pid.c
| @@ -281,13 +281,14 @@ void free_pid(struct pid *pid)
| call_rcu(&pid->rcu, delayed_put_pid);
| }
|
| -struct pid *alloc_pid(struct pid_namespace *ns)
| +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
| {
| struct pid *pid;
| enum pid_type type;
| int i, nr;
| struct pid_namespace *tmp;
| struct upid *upid;
| + pid_t tpid;
|
| pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
| if (!pid) {
| @@ -297,7 +298,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
|
| tmp = ns;
| for (i = ns->level; i >= 0; i--) {
| - nr = alloc_pidmap(tmp);
| + tpid = 0;
| + if (target_pids)
| + tpid = target_pids[i];
| +
| + nr = set_pidmap(tmp, tpid);
| if (nr < 0)
| goto out_free;
|
| --
| 1.6.0.4
|
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [v12][PATCH 5/9] Add target_pids parameter to copy_process()
[not found] ` <20091111044403.GE11393@suka>
@ 2009-11-11 22:40 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw)
To: Andrew Morton
Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
Pavel Emelyanov, linux-kernel
CC: LKML
Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
|
| Subject: [v12][PATCH 5/9] Add target_pids parameter to copy_process()
|
| Add a 'target_pids' parameter to copy_process(). The new parameter will be
| used in a follow-on patch when eclone() is implemented.
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Serge Hallyn <serue@us.ibm.com>
| Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
| kernel/fork.c | 7 ++++---
| 1 files changed, 4 insertions(+), 3 deletions(-)
|
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 3f1dddf..c8a06de 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -975,12 +975,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
| unsigned long stack_size,
| int __user *child_tidptr,
| struct pid *pid,
| + pid_t *target_pids,
| int trace)
| {
| int retval;
| struct task_struct *p;
| int cgroup_callbacks_done = 0;
| - pid_t *target_pids = NULL;
|
| if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
| return ERR_PTR(-EINVAL);
| @@ -1361,7 +1361,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
| struct pt_regs regs;
|
| task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL,
| - &init_struct_pid, 0);
| + &init_struct_pid, NULL, 0);
| if (!IS_ERR(task))
| init_idle(task, cpu);
|
| @@ -1384,6 +1384,7 @@ long do_fork(unsigned long clone_flags,
| struct task_struct *p;
| int trace = 0;
| long nr;
| + pid_t *target_pids = NULL;
|
| /*
| * Do some preliminary argument and permissions checking before we
| @@ -1424,7 +1425,7 @@ long do_fork(unsigned long clone_flags,
| trace = tracehook_prepare_clone(clone_flags);
|
| p = copy_process(clone_flags, stack_start, regs, stack_size,
| - child_tidptr, NULL, trace);
| + child_tidptr, NULL, target_pids, trace);
| /*
| * Do this prior waking up the new thread - the thread pointer
| * might get invalid after that point, if the thread exits quickly.
| --
| 1.6.0.4
|
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [v12][PATCH 6/9] Check invalid clone flags
[not found] ` <20091111044422.GF11393@suka>
@ 2009-11-11 22:40 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw)
To: Andrew Morton
Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
Pavel Emelyanov, linux-kernel
Cc: LKML
Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
|
| Subject: [v12][PATCH 6/9] Check invalid clone flags
|
| As pointed out by Oren Laadan, we want to ensure that unused bits in the
| clone-flags remain unused and available for future. To ensure this, define
| a mask of clone-flags and check the flags in the clone() system calls.
|
| Changelog[v9]:
| - Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS
| to avoid breaking any applications that may have set it. IOW, this
| patch/check only applies to clone-flags bits 33 and higher.
|
| Changelog[v8]:
| - New patch in set
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
| include/linux/sched.h | 12 ++++++++++++
| kernel/fork.c | 3 +++
| 2 files changed, 15 insertions(+), 0 deletions(-)
|
| diff --git a/include/linux/sched.h b/include/linux/sched.h
| index 75e6e60..a4d2c23 100644
| --- a/include/linux/sched.h
| +++ b/include/linux/sched.h
| @@ -29,6 +29,18 @@
| #define CLONE_NEWNET 0x40000000 /* New network namespace */
| #define CLONE_IO 0x80000000 /* Clone io context */
|
| +#define CLONE_UNUSED 0x00001000 /* Can be reused ? */
| +
| +#define VALID_CLONE_FLAGS (CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
| + CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\
| + CLONE_VFORK | CLONE_PARENT | CLONE_THREAD |\
| + CLONE_NEWNS | CLONE_SYSVSEM | CLONE_SETTLS |\
| + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |\
| + CLONE_DETACHED | CLONE_UNTRACED |\
| + CLONE_CHILD_SETTID | CLONE_STOPPED |\
| + CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
| + CLONE_NEWPID | CLONE_NEWNET | CLONE_IO)
| +
| /*
| * Scheduling policies
| */
| diff --git a/kernel/fork.c b/kernel/fork.c
| index c8a06de..11f77ed 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -982,6 +982,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
| struct task_struct *p;
| int cgroup_callbacks_done = 0;
|
| + if (clone_flags & ~VALID_CLONE_FLAGS)
| + return ERR_PTR(-EINVAL);
| +
| if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
| return ERR_PTR(-EINVAL);
|
| --
| 1.6.0.4
|
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [v12][PATCH 7/9] Define do_fork_with_pids()
[not found] ` <20091111044438.GG11393@suka>
@ 2009-11-11 22:40 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw)
To: Andrew Morton
Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
Pavel Emelyanov, linux-kernel
cc: lkml
Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
|
| Subject: [v12][PATCH 7/9] Define do_fork_with_pids()
|
| do_fork_with_pids() is same as do_fork(), except that it takes an
| additional, 'pid_set', parameter. This parameter, currently unused,
| specifies the set of target pids of the process in each of its pid
| namespaces.
|
| Changelog[v7]:
| - Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
| instead of 'struct pid_set *'.
|
| Changelog[v6]:
| - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
| Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
| is constant across architectures.
| - (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.
|
| Changelog[v4]:
| - Rename 'struct target_pid_set' to 'struct pid_set' since it may
| be useful in other contexts.
|
| Changelog[v3]:
| - Fix "long-line" warning from checkpatch.pl
|
| Changelog[v2]:
| - To facilitate moving architecture-inpdendent code to kernel/fork.c
| pass in 'struct target_pid_set __user *' to do_fork_with_pids()
| rather than 'pid_t *' (next patch moves the arch-independent
| code to kernel/fork.c)
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Serge Hallyn <serue@us.ibm.com>
| Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
| include/linux/sched.h | 3 +++
| kernel/fork.c | 17 +++++++++++++++--
| 2 files changed, 18 insertions(+), 2 deletions(-)
|
| diff --git a/include/linux/sched.h b/include/linux/sched.h
| index a4d2c23..85e971a 100644
| --- a/include/linux/sched.h
| +++ b/include/linux/sched.h
| @@ -2153,6 +2153,9 @@ extern int disallow_signal(int);
|
| extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
| extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
| +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
| + unsigned long, int __user *, int __user *,
| + unsigned int, pid_t __user *);
| struct task_struct *fork_idle(int);
|
| extern void set_task_comm(struct task_struct *tsk, char *from);
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 11f77ed..210e841 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -1377,12 +1377,14 @@ struct task_struct * __cpuinit fork_idle(int cpu)
| * It copies the process, and if successful kick-starts
| * it and waits for it to finish using the VM if required.
| */
| -long do_fork(unsigned long clone_flags,
| +long do_fork_with_pids(unsigned long clone_flags,
| unsigned long stack_start,
| struct pt_regs *regs,
| unsigned long stack_size,
| int __user *parent_tidptr,
| - int __user *child_tidptr)
| + int __user *child_tidptr,
| + unsigned int num_pids,
| + pid_t __user *upids)
| {
| struct task_struct *p;
| int trace = 0;
| @@ -1485,6 +1487,17 @@ long do_fork(unsigned long clone_flags,
| return nr;
| }
|
| +long do_fork(unsigned long clone_flags,
| + unsigned long stack_start,
| + struct pt_regs *regs,
| + unsigned long stack_size,
| + int __user *parent_tidptr,
| + int __user *child_tidptr)
| +{
| + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
| + parent_tidptr, child_tidptr, 0, NULL);
| +}
| +
| #ifndef ARCH_MIN_MMSTRUCT_ALIGN
| #define ARCH_MIN_MMSTRUCT_ALIGN 0
| #endif
| --
| 1.6.0.4
|
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [v12][PATCH 8/9] Define eclone() syscall
[not found] ` <20091111044509.GH11393@suka>
@ 2009-11-11 22:40 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw)
To: Andrew Morton
Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
Pavel Emelyanov, linux-kernel
cc: lkml
Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
|
| Subject: [v12][PATCH 8/9] Define eclone() syscall
|
| Container restart requires that a task have the same pid it had when it was
| checkpointed. When containers are nested the tasks within the containers
| exist in multiple pid namespaces and hence have multiple pids to specify
| during restart.
|
| eclone(), intended for use during restart, is the same as
| clone(), except that it takes a 'pids' paramter. This parameter lets
| caller choose specific pid numbers for the child process, in the
| process's active and ancestor pid namespaces. (Descendant pid namespaces
| in general don't matter since processes don't have pids in them anyway,
| but see comments in copy_target_pids() regarding CLONE_NEWPID).
|
| eclone() also attempts to address a second limitation of the
| clone() system call. clone() is restricted to 32 clone flags and all but
| one of these are in use. If more new clone flags are needed, we will be
| forced to define a new variant of the clone() system call. To address
| this, eclone() allows at least 64 clone flags with some room
| for more if necessary.
|
| To prevent unprivileged processes from misusing this interface,
| eclone() currently needs CAP_SYS_ADMIN, when the 'pids' parameter
| is non-NULL.
|
| See Documentation/eclone in next patch for more details and an
| example of its usage.
|
| NOTE:
| - System calls are restricted to 6 parameters and the number and sizes
| of parameters needed for eclone() exceed 6 integers. The new
| prototype works around this restriction while providing some
| flexibility if eclone() needs to be further extended in the
| future.
| TODO:
| - We should convert clone-flags to 64-bit value in all architectures.
| Its probably best to do that as a separate patchset since clone_flags
| touches several functions and that patchset seems independent of this
| new system call.
|
| Changelog[v12]:
| - [Serge Hallyn] Ignore ->child_stack_size if ->child_stack_base
| is NULL.
| - [Oren Laadan, Serge Hallyn] Rename clone_with_pids() to eclone()
| Changelog[v11]:
| - [Dave Hansen] Move clone_args validation checks to arch-indpeendent
| code.
| - [Oren Laadan] Make args_size a parameter to system call and remove
| it from 'struct clone_args'
|
| Changelog[v10]:
| - Rename clone3() to clone_with_pids()
| - [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall
| implementation
|
| Changelog[v9]:
| - [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
| architectures split the new clone-flags into 'low' and 'high'
| words and pass in the 'lower' flags as the first argument.
| This would maintain similarity of the clone3() with clone()/
| clone2(). Also has the side-effect of the name matching the
| number of parameters :-)
| - [Roland McGrath] Rename structure to 'clone_args' and add a
| 'child_stack_size' field
|
| Changelog[v8]
| - [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
| must be 64-bit.
| - clone2() is in use in IA64. Rename system call to clone3().
|
| Changelog[v7]:
| - [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
| and group parameters into a new 'struct clone_struct' object.
|
| Changelog[v6]:
| - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
| Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
| is constant across architectures.
| - (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
| 'unum_pids < 0' check.
|
| Changelog[v4]:
| - (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'
|
| Changelog[v3]:
| - (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
| in the target_pids[] list and setting it 0. See copy_target_pids()).
| - (Oren Laadan) Specified target pids should apply only to youngest
| pid-namespaces (see copy_target_pids())
| - (Matt Helsley) Update patch description.
|
| Changelog[v2]:
| - Remove unnecessary printk and add a note to callers of
| copy_target_pids() to free target_pids.
| - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
| - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
| 'num_pids == 0' (fall back to normal clone()).
| - Move arch-independent code (sanity checks and copy-in of target-pids)
| into kernel/fork.c and simplify sys_clone_with_pids()
|
| Changelog[v1]:
| - Fixed some compile errors (had fixed these errors earlier in my
| git tree but had not refreshed patches before emailing them)
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Acked-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
| arch/x86/include/asm/syscalls.h | 1 +
| arch/x86/include/asm/unistd_32.h | 1 +
| arch/x86/kernel/entry_32.S | 1 +
| arch/x86/kernel/process_32.c | 47 ++++++++++++++
| arch/x86/kernel/syscall_table_32.S | 1 +
| include/linux/sched.h | 2 +
| include/linux/types.h | 11 +++
| kernel/fork.c | 124 +++++++++++++++++++++++++++++++++++-
| 8 files changed, 187 insertions(+), 1 deletions(-)
|
| diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
| index 372b76e..2cadb8e 100644
| --- a/arch/x86/include/asm/syscalls.h
| +++ b/arch/x86/include/asm/syscalls.h
| @@ -40,6 +40,7 @@ long sys_iopl(struct pt_regs *);
|
| /* kernel/process_32.c */
| int sys_clone(struct pt_regs *);
| +int sys_eclone(struct pt_regs *);
| int sys_execve(struct pt_regs *);
|
| /* kernel/signal.c */
| diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
| index 6fb3c20..030b121 100644
| --- a/arch/x86/include/asm/unistd_32.h
| +++ b/arch/x86/include/asm/unistd_32.h
| @@ -342,6 +342,7 @@
| #define __NR_pwritev 334
| #define __NR_rt_tgsigqueueinfo 335
| #define __NR_perf_event_open 336
| +#define __NR_eclone 337
|
| #ifdef __KERNEL__
|
| diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
| index c097e7d..7e7f3c8 100644
| --- a/arch/x86/kernel/entry_32.S
| +++ b/arch/x86/kernel/entry_32.S
| @@ -718,6 +718,7 @@ ptregs_##name: \
| PTREGSCALL(iopl)
| PTREGSCALL(fork)
| PTREGSCALL(clone)
| +PTREGSCALL(eclone)
| PTREGSCALL(vfork)
| PTREGSCALL(execve)
| PTREGSCALL(sigaltstack)
| diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
| index 4cf7956..be7c1a1 100644
| --- a/arch/x86/kernel/process_32.c
| +++ b/arch/x86/kernel/process_32.c
| @@ -445,6 +445,53 @@ int sys_clone(struct pt_regs *regs)
| return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr);
| }
|
| +int sys_eclone(struct pt_regs *regs)
| +{
| + int rc;
| + int args_size;
| + struct clone_args kca;
| + unsigned long flags;
| + int __user *parent_tid_ptr;
| + int __user *child_tid_ptr;
| + unsigned long __user child_stack;
| + unsigned long stack_size;
| + unsigned int flags_low;
| + struct clone_args __user *uca;
| + pid_t __user *pids;
| +
| + flags_low = regs->bx;
| + uca = (struct clone_args __user *)regs->cx;
| + args_size = regs->dx;
| + pids = (pid_t __user *)regs->di;
| +
| + rc = fetch_clone_args_from_user(uca, args_size, &kca);
| + if (rc)
| + return rc;
| +
| + /*
| + * TODO: Convert 'clone-flags' to 64-bits on all architectures.
| + * TODO: When ->clone_flags_high is non-zero, copy it in to the
| + * higher word(s) of 'flags':
| + *
| + * flags = (kca.clone_flags_high << 32) | flags_low;
| + */
| + flags = flags_low;
| + parent_tid_ptr = (int *)kca.parent_tid_ptr;
| + child_tid_ptr = (int *)kca.child_tid_ptr;
| + stack_size = (unsigned long)kca.child_stack_size;
| +
| + child_stack = 0UL;
| + if (kca.child_stack_base)
| + child_stack = (unsigned long)kca.child_stack_base + stack_size;
| +
| + if (!child_stack)
| + child_stack = regs->sp;
| +
| + return do_fork_with_pids(flags, child_stack, regs, stack_size,
| + parent_tid_ptr, child_tid_ptr, kca.nr_pids,
| + pids);
| +}
| +
| /*
| * sys_execve() executes a new program.
| */
| diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
| index 0157cd2..2b6394c 100644
| --- a/arch/x86/kernel/syscall_table_32.S
| +++ b/arch/x86/kernel/syscall_table_32.S
| @@ -336,3 +336,4 @@ ENTRY(sys_call_table)
| .long sys_pwritev
| .long sys_rt_tgsigqueueinfo /* 335 */
| .long sys_perf_event_open
| + .long ptregs_eclone
| diff --git a/include/linux/sched.h b/include/linux/sched.h
| index 85e971a..b2a881d 100644
| --- a/include/linux/sched.h
| +++ b/include/linux/sched.h
| @@ -2153,6 +2153,8 @@ extern int disallow_signal(int);
|
| extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
| extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
| +extern int fetch_clone_args_from_user(struct clone_args __user *, int,
| + struct clone_args *);
| extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
| unsigned long, int __user *, int __user *,
| unsigned int, pid_t __user *);
| diff --git a/include/linux/types.h b/include/linux/types.h
| index c42724f..0d08683 100644
| --- a/include/linux/types.h
| +++ b/include/linux/types.h
| @@ -204,6 +204,17 @@ struct ustat {
| char f_fpack[6];
| };
|
| +struct clone_args {
| + u64 clone_flags_high;
| + u64 child_stack_base;
| + u64 child_stack_size;
| + u64 parent_tid_ptr;
| + u64 child_tid_ptr;
| + u32 nr_pids;
| + u32 reserved0;
| + u64 reserved1;
| +};
| +
| #endif /* __KERNEL__ */
| #endif /* __ASSEMBLY__ */
| #endif /* _LINUX_TYPES_H */
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 210e841..cacc26b 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -1372,6 +1372,114 @@ struct task_struct * __cpuinit fork_idle(int cpu)
| }
|
| /*
| + * If user specified any 'target-pids' in @upid_setp, copy them from
| + * user and return a pointer to a local copy of the list of pids. The
| + * caller must free the list, when they are done using it.
| + *
| + * If user did not specify any target pids, return NULL (caller should
| + * treat this like normal clone).
| + *
| + * On any errors, return the error code
| + */
| +static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
| +{
| + int j;
| + int rc;
| + int size;
| + int knum_pids; /* # of pids needed in kernel */
| + pid_t *target_pids;
| +
| + if (!unum_pids)
| + return NULL;
| +
| + knum_pids = task_pid(current)->level + 1;
| + if (unum_pids > knum_pids)
| + return ERR_PTR(-EINVAL);
| +
| + /*
| + * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
| + * and set it to 0. This last entry in target_pids[] corresponds to the
| + * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
| + * specified. If CLONE_NEWPID was not specified, this last entry will
| + * simply be ignored.
| + */
| + target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
| + if (!target_pids)
| + return ERR_PTR(-ENOMEM);
| +
| + /*
| + * A process running in a level 2 pid namespace has three pid namespaces
| + * and hence three pid numbers. If this process is checkpointed,
| + * information about these three namespaces are saved. We refer to these
| + * namespaces as 'known namespaces'.
| + *
| + * If this checkpointed process is however restarted in a level 3 pid
| + * namespace, the restarted process has an extra ancestor pid namespace
| + * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
| + *
| + * During restart, the process requests specific pids for its 'known
| + * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
| + *
| + * Since the requested-pids correspond to 'known namespaces' and since
| + * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
| + * namespaces', copy requested pids to the back-end of target_pids[]
| + * (i.e before the last entry for CLONE_NEWPID mentioned above).
| + * Any entries in target_pids[] not corresponding to a requested pid
| + * will be set to zero and kernel assigns a pid in those namespaces.
| + *
| + * NOTE: The order of pids in target_pids[] is oldest pid namespace to
| + * youngest (target_pids[0] corresponds to init_pid_ns). i.e.
| + * the order is:
| + *
| + * - pids for 'unknown-namespaces' (if any)
| + * - pids for 'known-namespaces' (requested pids)
| + * - 0 in the last entry (for CLONE_NEWPID).
| + */
| + j = knum_pids - unum_pids;
| + size = unum_pids * sizeof(pid_t);
| +
| + rc = copy_from_user(&target_pids[j], upids, size);
| + if (rc) {
| + rc = -EFAULT;
| + goto out_free;
| + }
| +
| + return target_pids;
| +
| +out_free:
| + kfree(target_pids);
| + return ERR_PTR(rc);
| +}
| +
| +int
| +fetch_clone_args_from_user(struct clone_args __user *uca, int args_size,
| + struct clone_args *kca)
| +{
| + int rc;
| +
| + /*
| + * TODO: If size of clone_args is not what the kernel expects, it
| + * could be that kernel is newer and has an extended structure.
| + * When that happens, this check needs to be smarter. For now,
| + * assume exact match.
| + */
| + if (args_size != sizeof(struct clone_args))
| + return -EINVAL;
| +
| + rc = copy_from_user(kca, uca, args_size);
| + if (rc)
| + return -EFAULT;
| +
| + /*
| + * To avoid future compatibility issues, ensure unused fields are 0.
| + */
| + if (kca->reserved0 || kca->reserved1 || kca->clone_flags_high)
| + return -EINVAL;
| +
| + return 0;
| +}
| +
| +/*
| * Ok, this is the main fork-routine.
| *
| * It copies the process, and if successful kick-starts
| @@ -1389,7 +1497,7 @@ long do_fork_with_pids(unsigned long clone_flags,
| struct task_struct *p;
| int trace = 0;
| long nr;
| - pid_t *target_pids = NULL;
| + pid_t *target_pids;
|
| /*
| * Do some preliminary argument and permissions checking before we
| @@ -1423,6 +1531,16 @@ long do_fork_with_pids(unsigned long clone_flags,
| }
| }
|
| + target_pids = copy_target_pids(num_pids, upids);
| + if (target_pids) {
| + if (IS_ERR(target_pids))
| + return PTR_ERR(target_pids);
| +
| + nr = -EPERM;
| + if (!capable(CAP_SYS_ADMIN))
| + goto out_free;
| + }
| +
| /*
| * When called from kernel_thread, don't do user tracing stuff.
| */
| @@ -1484,6 +1602,10 @@ long do_fork_with_pids(unsigned long clone_flags,
| } else {
| nr = PTR_ERR(p);
| }
| +
| +out_free:
| + kfree(target_pids);
| +
| return nr;
| }
|
| --
| 1.6.0.4
|
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [v12][PATCH 9/9] Document eclone() syscall
[not found] ` <20091111044527.GI11393@suka>
@ 2009-11-11 22:41 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 10+ messages in thread
From: Sukadev Bhattiprolu @ 2009-11-11 22:41 UTC (permalink / raw)
To: Andrew Morton
Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc,
Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland,
Pavel Emelyanov, linux-kernel
CC: LKML
Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote:
|
| Subject: [v12][PATCH 9/9] Document eclone() syscall
|
| This gives a brief overview of the eclone() system call. We should
| eventually describe more details in existing clone(2) man page or in
| a new man page.
|
| Changelog[v12]:
| - [Serge Hallyn] Fix/simplify stack-setup in the example code
| - [Serge Hallyn, Oren Laadan] Rename syscall to eclone()
|
| Changelog[v11]:
| - [Dave Hansen] Move clone_args validation checks to arch-indpendent
| code.
| - [Oren Laadan] Make args_size a parameter to system call and remove
| it from 'struct clone_args'
| - [Oren Laadan] Fix some typos and clarify the order of pids in the
| @pids parameter.
|
| Changelog[v10]:
| - Rename clone3() to clone_with_pids() and fix some typos.
| - Modify example to show usage with the ptregs implementation.
| Changelog[v9]:
| - [Pavel Machek]: Fix an inconsistency and rename new file to
| Documentation/clone3.
| - [Roland McGrath, H. Peter Anvin] Updates to description and
| example to reflect new prototype of clone3() and the updated/
| renamed 'struct clone_args'.
|
| Changelog[v8]:
| - clone2() is already in use in IA64. Rename syscall to clone3()
| - Add notes to say that we return -EINVAL if invalid clone flags
| are specified or if the reserved fields are not 0.
| Changelog[v7]:
| - Rename clone_with_pids() to clone2()
| - Changes to reflect new prototype of clone2() (using clone_struct).
|
| Signed-off-by: Sukadev Bhattiprolu <sukadev@vnet.linux.ibm.com>
| Acked-by: Oren Laadan <orenl@cs.columbia.edu>
| ---
| Documentation/eclone | 345 ++++++++++++++++++++++++++++++++++++++++++++++++++
| 1 files changed, 345 insertions(+), 0 deletions(-)
| create mode 100644 Documentation/eclone
|
| diff --git a/Documentation/eclone b/Documentation/eclone
| new file mode 100644
| index 0000000..9b4ec84
| --- /dev/null
| +++ b/Documentation/eclone
| @@ -0,0 +1,345 @@
| +
| +struct clone_args {
| + u64 clone_flags_high;
| + u64 child_stack_base;
| + u64 child_stack_size;
| + u64 parent_tid_ptr;
| + u64 child_tid_ptr;
| + u32 nr_pids;
| + u32 reserved0;
| + u64 reserved1;
| +};
| +
| +
| +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
| + pid_t * __user pids)
| +
| + In addition to doing everything that clone() system call does, the
| + eclone() system call:
| +
| + - allows additional clone flags (31 of 32 bits in the flags
| + parameter to clone() are in use)
| +
| + - allows user to specify a pid for the child process in its
| + active and ancestor pid namespaces.
| +
| + This system call is meant to be used when restarting an application
| + from a checkpoint. Such restart requires that the processes in the
| + application have the same pids they had when the application was
| + checkpointed. When containers are nested, the processes within the
| + containers exist in multiple pid namespaces and hence have multiple
| + pids to specify during restart.
| +
| + The @flags_low parameter is identical to the 'clone_flags' parameter
| + in existing clone() system call.
| +
| + The fields in 'struct clone_args' are meant to be used as follows:
| +
| + u64 clone_flags_high:
| +
| + When eclone() supports more than 32 flags, the additional bits
| + in the clone_flags should be specified in this field. This
| + field is currently unused and must be set to 0.
| +
| + u64 child_stack_base;
| + u64 child_stack_size;
| +
| + These two fields correspond to the 'child_stack' fields in
| + clone() and clone2() (on IA64) system calls.
| +
| + u64 parent_tid_ptr;
| + u64 child_tid_ptr;
| +
| + These two fields correspond to the 'parent_tid_ptr' and
| + 'child_tid_ptr' fields in the clone() system call
| +
| + u32 nr_pids;
| +
| + nr_pids specifies the number of pids in the @pids array
| + parameter to eclone() (see below). nr_pids should not exceed
| + the current nesting level of the calling process (i.e if the
| + process is in init_pid_ns, nr_pids must be 1, if process is
| + in a pid namespace that is a child of init-pid-ns, nr_pids
| + cannot exceed 2, and so on).
| +
| + u32 reserved0;
| + u64 reserved1;
| +
| + These fields are intended to extend the functionality of the
| + eclone() in the future, while preserving backward compatibility.
| + They must be set to 0 for now.
| +
| + The @cargs_size parameter specifes the sizeof(struct clone_args) and
| + is intended to enable extending this structure in the future, while
| + preserving backward compatibility. For now, this field must be set
| + to the sizeof(struct clone_args) and this size must match the kernel's
| + view of the structure.
| +
| + The @pids parameter defines the set of pids that should be assigned to
| + the child process in its active and ancestor pid namespaces. The
| + descendant pid namespaces do not matter since a process does not have a
| + pid in descendant namespaces, unless the process is in a new pid
| + namespace in which case the process is a container-init (and must have
| + the pid 1 in that namespace).
| +
| + See CLONE_NEWPID section of clone(2) man page for details about pid
| + namespaces.
| +
| + If a pid in the @pids list is 0, the kernel will assign the next
| + available pid in the pid namespace.
| +
| + If a pid in the @pids list is non-zero, the kernel tries to assign
| + the specified pid in that namespace. If that pid is already in use
| + by another process, the system call fails (see EBUSY below).
| +
| + The order of pids in @pids is oldest in pids[0] to youngest pid
| + namespace in pids[nr_pids-1]. If the number of pids specified in the
| + @pids list is fewer than the nesting level of the process, the pids
| + are applied from youngest namespace. i.e if the process is nested in
| + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
| + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
| + have a pid of '0' (the kernel will assign a pid in those namespaces).
| +
| + On success, the system call returns the pid of the child process in
| + the parent's active pid namespace.
| +
| + On failure, eclone() returns -1 and sets 'errno' to one of following
| + values (the child process is not created).
| +
| + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to
| + specify the pids in this call (if pids are not specifed
| + CAP_SYS_ADMIN is not required).
| +
| + EINVAL The number of pids specified in 'clone_args.nr_pids' exceeds
| + the current nesting level of parent process
| +
| + EINVAL Not all specified clone-flags are valid.
| +
| + EINVAL The reserved fields in the clone_args argument are not 0.
| +
| + EBUSY A requested pid is in use by another process in that namespace.
| +
| +---
| +/*
| + * Example eclone() usage - Create a child process with pid CHILD_TID1 in
| + * the current pid namespace. The child gets the usual "random" pid in any
| + * ancestor pid namespaces.
| + */
| +#include <stdio.h>
| +#include <stdlib.h>
| +#include <string.h>
| +#include <signal.h>
| +#include <errno.h>
| +#include <unistd.h>
| +#include <wait.h>
| +#include <sys/syscall.h>
| +
| +#define __NR_eclone 337
| +#define CLONE_NEWPID 0x20000000
| +#define CLONE_CHILD_SETTID 0x01000000
| +#define CLONE_PARENT_SETTID 0x00100000
| +#define CLONE_UNUSED 0x00001000
| +
| +#define STACKSIZE 8192
| +
| +typedef unsigned long long u64;
| +typedef unsigned int u32;
| +typedef int pid_t;
| +struct clone_args {
| + u64 clone_flags_high;
| +
| + u64 child_stack_base;
| + u64 child_stack_size;
| +
| + u64 parent_tid_ptr;
| + u64 child_tid_ptr;
| +
| + u32 nr_pids;
| +
| + u32 reserved0;
| + u64 reserved1;
| +};
| +
| +#define exit _exit
| +
| +/*
| + * Following eclone() is based on code posted by Oren Laadan at:
| + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
| + */
| +#if defined(__i386__) && defined(__NR_eclone)
| +
| +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
| + int *pids)
| +{
| + long retval;
| +
| + __asm__ __volatile__(
| + "movl %0, %%ebx\n\t" /* flags -> 1st (ebx) */
| + "movl %1, %%ecx\n\t" /* clone_args -> 2nd (ecx)*/
| + "movl %2, %%edx\n\t" /* args_size -> 3rd (edx) */
| + "movl %3, %%edi\n\t" /* pids -> 4th (edi)*/
| + "pushl %%ebp\n\t" /* save value of ebp */
| + :
| + :"b" (flags_low),
| + "c" (clone_args),
| + "d" (args_size),
| + "D" (pids)
| + );
| +
| + __asm__ __volatile__(
| + "int $0x80\n\t" /* Linux/i386 system call */
| + "testl %0,%0\n\t" /* check return value */
| + "jne 1f\n\t" /* jump if parent */
| + "popl %%ebx\n\t" /* get subthread function */
| + "call *%%ebx\n\t" /* start subthread function */
| + "movl %2,%0\n\t"
| + "int $0x80\n" /* exit system call: exit subthread */
| + "1:\n\t"
| + "popl %%ebp\t" /* restore parent's ebp */
| + :"=a" (retval)
| + :"0" (__NR_eclone), "i" (__NR_exit)
| + :"ebx", "ecx", "edx"
| + );
| +
| + if (retval < 0) {
| + errno = -retval;
| + retval = -1;
| + }
| + return retval;
| +}
| +
| +/*
| + * Allocate a stack for the clone-child and arrange to have the child
| + * execute @child_fn with @child_arg as the argument.
| + */
| +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
| +{
| + void *stack_base;
| + void **stack_top;
| + int frame_size;
| +
| + /*
| + * Allocate extra space to push @child_arg and @child_fn near
| + * top of stack
| + */
| + frame_size= 12;
| +
| + stack_base = malloc(size + frame_size);
| + if (!stack_base) {
| + perror("malloc()");
| + exit(1);
| + }
| +
| + stack_top = (void **)((char *)stack_base + (size + frame_size - 4));
| + *--stack_top = child_arg;
| + *--stack_top = child_fn;
| +
| + return stack_base;
| +}
| +#endif
| +
| +/* gettid() is a bit more useful than getpid() when messing with clone() */
| +int gettid()
| +{
| + int rc;
| +
| + rc = syscall(__NR_gettid, 0, 0, 0);
| + if (rc < 0) {
| + printf("rc %d, errno %d\n", rc, errno);
| + exit(1);
| + }
| + return rc;
| +}
| +
| +#define CHILD_TID1 377
| +#define CHILD_TID2 1177
| +#define CHILD_TID3 2799
| +
| +struct clone_args clone_args;
| +void *child_arg = &clone_args;
| +int child_tid;
| +
| +int do_child(void *arg)
| +{
| + struct clone_args *cs = (struct clone_args *)arg;
| + int ctid;
| +
| + /* Verify we pushed the arguments correctly on the stack... */
| + if (arg != child_arg) {
| + printf("Child: Incorrect child arg pointer, expected %p,"
| + "actual %p\n", child_arg, arg);
| + exit(1);
| + }
| +
| + /* ... and that we got the thread-id we expected */
| + ctid = *((int *)cs->child_tid_ptr);
| + if (ctid != CHILD_TID1) {
| + printf("Child: Incorrect child tid, expected %d, actual %d\n",
| + CHILD_TID1, ctid);
| + exit(1);
| + } else {
| + printf("Child got the expected tid, %d\n", gettid());
| + }
| + sleep(3);
| +
| + printf("[%d, %d]: Child exiting\n", getpid(), ctid);
| + exit(0);
| +}
| +
| +static int do_clone(int (*child_fn)(void *), void *child_arg,
| + unsigned int flags_low, int nr_pids, pid_t *pids_list)
| +{
| + int rc;
| + void *stack;
| + struct clone_args *ca = &clone_args;
| + int args_size;
| +
| + stack = setup_stack(child_fn, child_arg, STACKSIZE);
| +
| + memset(ca, 0, sizeof(*ca));
| +
| + ca->child_stack_base = (u64)stack;
| + ca->child_stack_size = (u64)STACKSIZE;
| + ca->child_tid_ptr = (u64)&child_tid;
| + ca->nr_pids = nr_pids;
| +
| + args_size = sizeof(struct clone_args);
| + rc = eclone(flags_low, ca, args_size, pids_list);
| +
| + printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
| + rc, errno);
| + return rc;
| +}
| +
| +/*
| + * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
| + * The test case creates a child in the current pid namespace and uses only
| + * the first value, CHILD_TID1 (note that nr_pids == 1).
| + */
| +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
| +main()
| +{
| + int rc, pid, ret, status;
| + unsigned long flags;
| + int nr_pids = 1;
| +
| + flags = SIGCHLD|CLONE_CHILD_SETTID;
| +
| + pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
| +
| + printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
| +
| + rc = waitpid(pid, &status, __WALL);
| + if (rc < 0) {
| + printf("waitpid(): rc %d, error %d\n", rc, errno);
| + } else {
| + printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
| + gettid(), rc, status);
| +
| + if (WIFEXITED(status)) {
| + printf("\t EXITED, %d\n", WEXITSTATUS(status));
| + } else if (WIFSIGNALED(status)) {
| + printf("\t SIGNALED, %d\n", WTERMSIG(status));
| + }
| + }
| +}
| --
| 1.6.0.4
|
| _______________________________________________
| Containers mailing list
| Containers@lists.linux-foundation.org
| https://lists.linux-foundation.org/mailman/listinfo/containers
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2009-11-11 22:40 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20091111043440.GA9377@suka>
2009-11-11 22:38 ` [v12][PATCH 0/9] Implement eclone() syscall Sukadev Bhattiprolu
[not found] ` <20091111044250.GA11393@suka>
2009-11-11 22:38 ` [v12][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
[not found] ` <20091111044313.GB11393@suka>
2009-11-11 22:39 ` [v12][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
[not found] ` <20091111044329.GC11393@suka>
2009-11-11 22:39 ` [v12][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
[not found] ` <20091111044347.GD11393@suka>
2009-11-11 22:39 ` [v12][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
[not found] ` <20091111044403.GE11393@suka>
2009-11-11 22:40 ` [v12][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
[not found] ` <20091111044422.GF11393@suka>
2009-11-11 22:40 ` [v12][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
[not found] ` <20091111044438.GG11393@suka>
2009-11-11 22:40 ` [v12][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
[not found] ` <20091111044509.GH11393@suka>
2009-11-11 22:40 ` [v12][PATCH 8/9] Define eclone() syscall Sukadev Bhattiprolu
[not found] ` <20091111044527.GI11393@suka>
2009-11-11 22:41 ` [v12][PATCH 9/9] Document " Sukadev Bhattiprolu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox