* Re: [v12][PATCH 0/9] Implement eclone() syscall [not found] <20091111043440.GA9377@suka> @ 2009-11-11 22:38 ` Sukadev Bhattiprolu [not found] ` <20091111044250.GA11393@suka> ` (8 subsequent siblings) 9 siblings, 0 replies; 10+ messages in thread From: Sukadev Bhattiprolu @ 2009-11-11 22:38 UTC (permalink / raw) To: Andrew Morton Cc: Oren Laadan, serue, Eric W. Biederman, Alexey Dobriyan, Pavel Emelyanov, hpa, Nathan Lynch, matthltc, arnd, roland, mtk.manpages, linux-api, Containers, linux-kernel Cc: LKML Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote: | | Based on discussions on containers mailing list and IRC, we settled on | the name eclone(). Please let me know of a better name or if there are | other comments on the patchset. | | --- | | Subject: [v12][PATCH 0/9] Implement eclone() syscall | | To support application checkpoint/restart, a task must have the same pid it | had when it was checkpointed. When containers are nested, the tasks within | the containers exist in multiple pid namespaces and hence have multiple pids | to specify during restart. | | This patchset implements a new system call, eclone() that lets a process | specify the pids of the child process. | | Patches 1 through 6 are helper patches needed for choosing a pid for the | child process. | | PATCH 8 implements the eclone() system call on x86. The interface defined in | PATCH 8 has been ported to s390 and ppc64 architectures, but they will be | posted as a separate patchset if this patchset is accepted. | | PATCH 9 adds some documentation on the new system call, some/all of which | will eventually go into a man page. | | Changelog[v12]: | - Ignore ->child_stack_size when ->child_stack_base is NULL (PATCH 8) | - Cleanup/simplify example in Documentation/eclone (PATCH 9). | - Rename sys call to a shorter name, eclone() | | Changelog[v11]: | - [Dave Hansen] Move clone_args validation checks to arch-indpeendent | code. | - [Oren Laadan] Make args_size a parameter to system call and remove | it from 'struct clone_args' | | Changelog[v10]: | - [Linus Torvalds] Use PTREGSCALL() implementation for clone rather | than the generic system call | - Rename clone3() to clone_with_pids() | - Update Documentation/clone_with_pids() to show example usage with | the PTREGSCALL implementation. | | Changelog[v9]: | - [Pavel Emelyanov] Drop the patch that made 'pid_max' a property | of struct pid_namespace | - [Roland McGrath, H. Peter Anvin and earlier on, Serge Hallyn] To | avoid inadvertent truncation clone_flags, preserve the first | parameter of clone3() as 'u32 clone_flags' and specify newer | flags in clone_args.flags_high (PATCH 8/9 and PATCH 9/9) | - [Eric Biederman] Generalize alloc_pidmap() code to simplify and | remove duplication (see PATCH 3/9]. | | Changelog[v8]: | - [Oren Laadan, Louis Rilling, KOSAKI Motohiro] | The name 'clone2()' is in use - renamed new syscall to clone3(). | - [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit. | - [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0. | (Added [PATCH 7/10] to the patchset). | | Changelog[v7]: | - [Peter Zijlstra, Arnd Bergmann] | Group the arguments to clone2() into a 'struct clone_arg' to | workaround the issue of exceeding 6 arguments to the system call. | Also define clone-flags as u64 to allow additional clone-flags. | | Changelog[v6]: | - [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds] | Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is | constant across architectures (Patches 7, 8). | - (Nathan Lynch) Change pid_set.num_pids to unsigned and remove | 'unum_pids < 0' check (Patches 7,8) | - (Pavel Machek) New patch (Patch 9) to add some documentation. | | Changelog[v5]: | - Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch | into this set) | - (Eric Biederman): Avoid the new function, set_pidmap() - added | couple of checks on 'target_pid' in alloc_pidmap() itself. | | === IMPORTANT NOTE: | | clone() system call has another limitation - all but one bits in clone-flags | are in use and if more new clone-flags are needed, we will need a variant of | the clone() system call. | | It appears to make sense to try and extend this new system call to address | this limitation as well. The requirements of a new clone system call could | then be summarized as: | | - do everything clone() does today, and | - give application an ability to choose pids for the child process | in all ancestor pid namespaces, and | - allow more clone_flags | | Contstraints: | | - system-calls are restricted to 6 parameters and clone() already | takes 5 parameters, any extension to clone() interface would require | one or more copy_from_user(). (Not sure if copy_from_user() of ~40 | bytes would have a significant impact on performance of clone()). | | Based on these requirements and constraints, we explored a couple of system | call interfaces (in earlier versions of this patchset). Based on input from | Arnd Bergmann and others, the new interface of the system call is: | | struct clone_args { | u64 clone_flags_high; | u64 child_stack_base; | u64 child_stack_size; | u64 parent_tid_ptr; | u64 child_tid_ptr; | u32 nr_pids; | u32 reserved0; | u64 reserved1; | }; | | sys_eclone(u32 flags_low, struct clone_args *cargs, int args_size, | pid_t *pids) | | Details of the struct clone_args and the usage are explained in the | documentation (PATCH 9/9). | | NOTE: | While this patchset enables support for more clone-flags, actual | implementation for additional clone-flags is best implemented as | a separate patchset (PATCH 8/9 identifies some TODOs) | | Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20091111044250.GA11393@suka>]
* Re: [v12][PATCH 1/9] Factor out code to allocate pidmap page [not found] ` <20091111044250.GA11393@suka> @ 2009-11-11 22:38 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 10+ messages in thread From: Sukadev Bhattiprolu @ 2009-11-11 22:38 UTC (permalink / raw) To: Andrew Morton Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc, Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland, Pavel Emelyanov, linux-kernel Cc LKML Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote: | | Subject: [v12][PATCH 1/9] Factor out code to allocate pidmap page | | To simplify alloc_pidmap(), move code to allocate a pid map page to a | separate function. | | Changelog[v3]: | - Earlier version of patchset called alloc_pidmap_page() from two | places. But now its called from only one place. Even so, moving | this code out into a separate function simplifies alloc_pidmap(). | Changelog[v2]: | - (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return | -ENOMEM on error instead of -1. | | Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> | Acked-by: Serge Hallyn <serue@us.ibm.com> | Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> | --- | kernel/pid.c | 45 ++++++++++++++++++++++++++++++--------------- | 1 files changed, 30 insertions(+), 15 deletions(-) | | diff --git a/kernel/pid.c b/kernel/pid.c | index d3f722d..7d4bb6e 100644 | --- a/kernel/pid.c | +++ b/kernel/pid.c | @@ -122,9 +122,35 @@ static void free_pidmap(struct upid *upid) | atomic_inc(&map->nr_free); | } | | +static int alloc_pidmap_page(struct pidmap *map) | +{ | + void *page; | + | + if (likely(map->page)) | + return 0; | + | + page = kzalloc(PAGE_SIZE, GFP_KERNEL); | + | + /* | + * Free the page if someone raced with us installing it: | + */ | + spin_lock_irq(&pidmap_lock); | + if (map->page) | + kfree(page); | + else | + map->page = page; | + spin_unlock_irq(&pidmap_lock); | + | + if (unlikely(!map->page)) | + return -ENOMEM; | + | + return 0; | +} | + | static int alloc_pidmap(struct pid_namespace *pid_ns) | { | int i, offset, max_scan, pid, last = pid_ns->last_pid; | + int rc; | struct pidmap *map; | | pid = last + 1; | @@ -134,21 +160,10 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) | map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; | max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; | for (i = 0; i <= max_scan; ++i) { | - if (unlikely(!map->page)) { | - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL); | - /* | - * Free the page if someone raced with us | - * installing it: | - */ | - spin_lock_irq(&pidmap_lock); | - if (map->page) | - kfree(page); | - else | - map->page = page; | - spin_unlock_irq(&pidmap_lock); | - if (unlikely(!map->page)) | - break; | - } | + rc = alloc_pidmap_page(map); | + if (rc) | + break; | + | if (likely(atomic_read(&map->nr_free))) { | do { | if (!test_and_set_bit(offset, map->page)) { | -- | 1.6.0.4 | | _______________________________________________ | Containers mailing list | Containers@lists.linux-foundation.org | https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20091111044313.GB11393@suka>]
* Re: [v12][PATCH 2/9] Have alloc_pidmap() return actual error code [not found] ` <20091111044313.GB11393@suka> @ 2009-11-11 22:39 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 10+ messages in thread From: Sukadev Bhattiprolu @ 2009-11-11 22:39 UTC (permalink / raw) To: Andrew Morton Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc, Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland, Pavel Emelyanov, linux-kernel CC: LKML Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote: | | Subject: [v12][PATCH 2/9] Have alloc_pidmap() return actual error code | | alloc_pidmap() can fail either because all pid numbers are in use or | because memory allocation failed. With support for setting a specific | pid number, alloc_pidmap() would also fail if either the given pid | number is invalid or in use. | | Rather than have callers assume -ENOMEM, have alloc_pidmap() return | the actual error. | | Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> | Acked-by: Serge Hallyn <serue@us.ibm.com> | Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> | --- | kernel/fork.c | 5 +++-- | kernel/pid.c | 14 +++++++++----- | 2 files changed, 12 insertions(+), 7 deletions(-) | | diff --git a/kernel/fork.c b/kernel/fork.c | index 4c20fff..93626b2 100644 | --- a/kernel/fork.c | +++ b/kernel/fork.c | @@ -1156,10 +1156,11 @@ static struct task_struct *copy_process(unsigned long clone_flags, | goto bad_fork_cleanup_io; | | if (pid != &init_struct_pid) { | - retval = -ENOMEM; | pid = alloc_pid(p->nsproxy->pid_ns); | - if (!pid) | + if (IS_ERR(pid)) { | + retval = PTR_ERR(pid); | goto bad_fork_cleanup_io; | + } | | if (clone_flags & CLONE_NEWPID) { | retval = pid_ns_prepare_proc(p->nsproxy->pid_ns); | diff --git a/kernel/pid.c b/kernel/pid.c | index 7d4bb6e..c4d9914 100644 | --- a/kernel/pid.c | +++ b/kernel/pid.c | @@ -150,7 +150,7 @@ static int alloc_pidmap_page(struct pidmap *map) | static int alloc_pidmap(struct pid_namespace *pid_ns) | { | int i, offset, max_scan, pid, last = pid_ns->last_pid; | - int rc; | + int rc = -EAGAIN; | struct pidmap *map; | | pid = last + 1; | @@ -189,12 +189,14 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) | } else { | map = &pid_ns->pidmap[0]; | offset = RESERVED_PIDS; | - if (unlikely(last == offset)) | + if (unlikely(last == offset)) { | + rc = -EAGAIN; | break; | + } | } | pid = mk_pid(pid_ns, map, offset); | } | - return -1; | + return rc; | } | | int next_pidmap(struct pid_namespace *pid_ns, int last) | @@ -263,8 +265,10 @@ struct pid *alloc_pid(struct pid_namespace *ns) | struct upid *upid; | | pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); | - if (!pid) | + if (!pid) { | + pid = ERR_PTR(-ENOMEM); | goto out; | + } | | tmp = ns; | for (i = ns->level; i >= 0; i--) { | @@ -299,7 +303,7 @@ out_free: | free_pidmap(pid->numbers + i); | | kmem_cache_free(ns->pid_cachep, pid); | - pid = NULL; | + pid = ERR_PTR(nr); | goto out; | } | | -- | 1.6.0.4 | | _______________________________________________ | Containers mailing list | Containers@lists.linux-foundation.org | https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20091111044329.GC11393@suka>]
* Re: [v12][PATCH 3/9] Define set_pidmap() function [not found] ` <20091111044329.GC11393@suka> @ 2009-11-11 22:39 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 10+ messages in thread From: Sukadev Bhattiprolu @ 2009-11-11 22:39 UTC (permalink / raw) To: Andrew Morton Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc, Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland, Pavel Emelyanov, linux-kernel Cc: LKML Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote: | | Subject: [v12][PATCH 3/9] Define set_pidmap() function | | Define a set_pidmap() interface which is like alloc_pidmap() only that | caller specifies the pid number to be assigned. | | Changelog[v9]: | - Completely rewrote this patch based on Eric Biederman's code. | Changelog[v7]: | - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids. | Changelog[v6]: | - Separate target_pid > 0 case to minimize the number of checks needed. | Changelog[v3]: | - (Eric Biederman): Avoid set_pidmap() function. Added couple of | checks for target_pid in alloc_pidmap() itself. | Changelog[v2]: | - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code | actually checks for 'pid <= 0' for completeness). | | Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> | Acked-by: Oren Laadan <orenl@cs.columbia.edu> | --- | kernel/pid.c | 41 +++++++++++++++++++++++++++++++++-------- | 1 files changed, 33 insertions(+), 8 deletions(-) | | diff --git a/kernel/pid.c b/kernel/pid.c | index c4d9914..afef766 100644 | --- a/kernel/pid.c | +++ b/kernel/pid.c | @@ -147,18 +147,19 @@ static int alloc_pidmap_page(struct pidmap *map) | return 0; | } | | -static int alloc_pidmap(struct pid_namespace *pid_ns) | +static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min, | + int max) | { | - int i, offset, max_scan, pid, last = pid_ns->last_pid; | + int i, offset, max_scan, pid; | int rc = -EAGAIN; | struct pidmap *map; | | pid = last + 1; | if (pid >= pid_max) | - pid = RESERVED_PIDS; | + pid = min; | offset = pid & BITS_PER_PAGE_MASK; | map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; | - max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; | + max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; | for (i = 0; i <= max_scan; ++i) { | rc = alloc_pidmap_page(map); | if (rc) | @@ -168,7 +169,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) | do { | if (!test_and_set_bit(offset, map->page)) { | atomic_dec(&map->nr_free); | - pid_ns->last_pid = pid; | return pid; | } | offset = find_next_offset(map, offset); | @@ -179,16 +179,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) | * bitmap block and the final block was the same | * as the starting point, pid is before last_pid. | */ | - } while (offset < BITS_PER_PAGE && pid < pid_max && | + } while (offset < BITS_PER_PAGE && pid < max && | (i != max_scan || pid < last || | !((last+1) & BITS_PER_PAGE_MASK))); | } | - if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) { | + if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) { | ++map; | offset = 0; | } else { | map = &pid_ns->pidmap[0]; | - offset = RESERVED_PIDS; | + offset = min; | if (unlikely(last == offset)) { | rc = -EAGAIN; | break; | @@ -199,6 +199,31 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) | return rc; | } | | +static int alloc_pidmap(struct pid_namespace *pid_ns) | +{ | + int nr; | + | + nr = do_alloc_pidmap(pid_ns, pid_ns->last_pid, RESERVED_PIDS, pid_max); | + if (nr >= 0) | + pid_ns->last_pid = nr; | + return nr; | +} | + | +static int set_pidmap(struct pid_namespace *pid_ns, int target) | +{ | + if (!target) | + return alloc_pidmap(pid_ns); | + | + if (target >= pid_max) | + return -EINVAL; | + | + if ((target < 0) || (target < RESERVED_PIDS && | + pid_ns->last_pid >= RESERVED_PIDS)) | + return -EINVAL; | + | + return do_alloc_pidmap(pid_ns, target - 1, target, target + 1); | +} | + | int next_pidmap(struct pid_namespace *pid_ns, int last) | { | int offset; | -- | 1.6.0.4 | | _______________________________________________ | Containers mailing list | Containers@lists.linux-foundation.org | https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20091111044347.GD11393@suka>]
* Re: [v12][PATCH 4/9] Add target_pids parameter to alloc_pid() [not found] ` <20091111044347.GD11393@suka> @ 2009-11-11 22:39 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 10+ messages in thread From: Sukadev Bhattiprolu @ 2009-11-11 22:39 UTC (permalink / raw) To: Andrew Morton Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc, Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland, Pavel Emelyanov, linux-kernel Cc: LKML Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote: | | Subject: [v12][PATCH 4/9] Add target_pids parameter to alloc_pid() | | This parameter is currently NULL, but will be used in a follow-on patch. | | Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> | Acked-by: Serge Hallyn <serue@us.ibm.com> | Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> | --- | include/linux/pid.h | 2 +- | kernel/fork.c | 3 ++- | kernel/pid.c | 9 +++++++-- | 3 files changed, 10 insertions(+), 4 deletions(-) | | diff --git a/include/linux/pid.h b/include/linux/pid.h | index 49f1c2f..914185d 100644 | --- a/include/linux/pid.h | +++ b/include/linux/pid.h | @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr); | extern struct pid *find_ge_pid(int nr, struct pid_namespace *); | int next_pidmap(struct pid_namespace *pid_ns, int last); | | -extern struct pid *alloc_pid(struct pid_namespace *ns); | +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids); | extern void free_pid(struct pid *pid); | | /* | diff --git a/kernel/fork.c b/kernel/fork.c | index 93626b2..3f1dddf 100644 | --- a/kernel/fork.c | +++ b/kernel/fork.c | @@ -980,6 +980,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, | int retval; | struct task_struct *p; | int cgroup_callbacks_done = 0; | + pid_t *target_pids = NULL; | | if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) | return ERR_PTR(-EINVAL); | @@ -1156,7 +1157,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, | goto bad_fork_cleanup_io; | | if (pid != &init_struct_pid) { | - pid = alloc_pid(p->nsproxy->pid_ns); | + pid = alloc_pid(p->nsproxy->pid_ns, target_pids); | if (IS_ERR(pid)) { | retval = PTR_ERR(pid); | goto bad_fork_cleanup_io; | diff --git a/kernel/pid.c b/kernel/pid.c | index afef766..3ee52cd 100644 | --- a/kernel/pid.c | +++ b/kernel/pid.c | @@ -281,13 +281,14 @@ void free_pid(struct pid *pid) | call_rcu(&pid->rcu, delayed_put_pid); | } | | -struct pid *alloc_pid(struct pid_namespace *ns) | +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids) | { | struct pid *pid; | enum pid_type type; | int i, nr; | struct pid_namespace *tmp; | struct upid *upid; | + pid_t tpid; | | pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); | if (!pid) { | @@ -297,7 +298,11 @@ struct pid *alloc_pid(struct pid_namespace *ns) | | tmp = ns; | for (i = ns->level; i >= 0; i--) { | - nr = alloc_pidmap(tmp); | + tpid = 0; | + if (target_pids) | + tpid = target_pids[i]; | + | + nr = set_pidmap(tmp, tpid); | if (nr < 0) | goto out_free; | | -- | 1.6.0.4 | | _______________________________________________ | Containers mailing list | Containers@lists.linux-foundation.org | https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20091111044403.GE11393@suka>]
* Re: [v12][PATCH 5/9] Add target_pids parameter to copy_process() [not found] ` <20091111044403.GE11393@suka> @ 2009-11-11 22:40 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 10+ messages in thread From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw) To: Andrew Morton Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc, Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland, Pavel Emelyanov, linux-kernel CC: LKML Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote: | | Subject: [v12][PATCH 5/9] Add target_pids parameter to copy_process() | | Add a 'target_pids' parameter to copy_process(). The new parameter will be | used in a follow-on patch when eclone() is implemented. | | Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> | Acked-by: Serge Hallyn <serue@us.ibm.com> | Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> | --- | kernel/fork.c | 7 ++++--- | 1 files changed, 4 insertions(+), 3 deletions(-) | | diff --git a/kernel/fork.c b/kernel/fork.c | index 3f1dddf..c8a06de 100644 | --- a/kernel/fork.c | +++ b/kernel/fork.c | @@ -975,12 +975,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, | unsigned long stack_size, | int __user *child_tidptr, | struct pid *pid, | + pid_t *target_pids, | int trace) | { | int retval; | struct task_struct *p; | int cgroup_callbacks_done = 0; | - pid_t *target_pids = NULL; | | if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) | return ERR_PTR(-EINVAL); | @@ -1361,7 +1361,7 @@ struct task_struct * __cpuinit fork_idle(int cpu) | struct pt_regs regs; | | task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL, | - &init_struct_pid, 0); | + &init_struct_pid, NULL, 0); | if (!IS_ERR(task)) | init_idle(task, cpu); | | @@ -1384,6 +1384,7 @@ long do_fork(unsigned long clone_flags, | struct task_struct *p; | int trace = 0; | long nr; | + pid_t *target_pids = NULL; | | /* | * Do some preliminary argument and permissions checking before we | @@ -1424,7 +1425,7 @@ long do_fork(unsigned long clone_flags, | trace = tracehook_prepare_clone(clone_flags); | | p = copy_process(clone_flags, stack_start, regs, stack_size, | - child_tidptr, NULL, trace); | + child_tidptr, NULL, target_pids, trace); | /* | * Do this prior waking up the new thread - the thread pointer | * might get invalid after that point, if the thread exits quickly. | -- | 1.6.0.4 | | _______________________________________________ | Containers mailing list | Containers@lists.linux-foundation.org | https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20091111044422.GF11393@suka>]
* Re: [v12][PATCH 6/9] Check invalid clone flags [not found] ` <20091111044422.GF11393@suka> @ 2009-11-11 22:40 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 10+ messages in thread From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw) To: Andrew Morton Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc, Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland, Pavel Emelyanov, linux-kernel Cc: LKML Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote: | | Subject: [v12][PATCH 6/9] Check invalid clone flags | | As pointed out by Oren Laadan, we want to ensure that unused bits in the | clone-flags remain unused and available for future. To ensure this, define | a mask of clone-flags and check the flags in the clone() system calls. | | Changelog[v9]: | - Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS | to avoid breaking any applications that may have set it. IOW, this | patch/check only applies to clone-flags bits 33 and higher. | | Changelog[v8]: | - New patch in set | | Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> | Acked-by: Oren Laadan <orenl@cs.columbia.edu> | --- | include/linux/sched.h | 12 ++++++++++++ | kernel/fork.c | 3 +++ | 2 files changed, 15 insertions(+), 0 deletions(-) | | diff --git a/include/linux/sched.h b/include/linux/sched.h | index 75e6e60..a4d2c23 100644 | --- a/include/linux/sched.h | +++ b/include/linux/sched.h | @@ -29,6 +29,18 @@ | #define CLONE_NEWNET 0x40000000 /* New network namespace */ | #define CLONE_IO 0x80000000 /* Clone io context */ | | +#define CLONE_UNUSED 0x00001000 /* Can be reused ? */ | + | +#define VALID_CLONE_FLAGS (CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\ | + CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\ | + CLONE_VFORK | CLONE_PARENT | CLONE_THREAD |\ | + CLONE_NEWNS | CLONE_SYSVSEM | CLONE_SETTLS |\ | + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |\ | + CLONE_DETACHED | CLONE_UNTRACED |\ | + CLONE_CHILD_SETTID | CLONE_STOPPED |\ | + CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\ | + CLONE_NEWPID | CLONE_NEWNET | CLONE_IO) | + | /* | * Scheduling policies | */ | diff --git a/kernel/fork.c b/kernel/fork.c | index c8a06de..11f77ed 100644 | --- a/kernel/fork.c | +++ b/kernel/fork.c | @@ -982,6 +982,9 @@ static struct task_struct *copy_process(unsigned long clone_flags, | struct task_struct *p; | int cgroup_callbacks_done = 0; | | + if (clone_flags & ~VALID_CLONE_FLAGS) | + return ERR_PTR(-EINVAL); | + | if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) | return ERR_PTR(-EINVAL); | | -- | 1.6.0.4 | | _______________________________________________ | Containers mailing list | Containers@lists.linux-foundation.org | https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20091111044438.GG11393@suka>]
* Re: [v12][PATCH 7/9] Define do_fork_with_pids() [not found] ` <20091111044438.GG11393@suka> @ 2009-11-11 22:40 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 10+ messages in thread From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw) To: Andrew Morton Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc, Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland, Pavel Emelyanov, linux-kernel cc: lkml Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote: | | Subject: [v12][PATCH 7/9] Define do_fork_with_pids() | | do_fork_with_pids() is same as do_fork(), except that it takes an | additional, 'pid_set', parameter. This parameter, currently unused, | specifies the set of target pids of the process in each of its pid | namespaces. | | Changelog[v7]: | - Drop 'struct pid_set' object and pass in 'pid_t *target_pids' | instead of 'struct pid_set *'. | | Changelog[v6]: | - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds) | Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set' | is constant across architectures. | - (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'. | | Changelog[v4]: | - Rename 'struct target_pid_set' to 'struct pid_set' since it may | be useful in other contexts. | | Changelog[v3]: | - Fix "long-line" warning from checkpatch.pl | | Changelog[v2]: | - To facilitate moving architecture-inpdendent code to kernel/fork.c | pass in 'struct target_pid_set __user *' to do_fork_with_pids() | rather than 'pid_t *' (next patch moves the arch-independent | code to kernel/fork.c) | | Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> | Acked-by: Serge Hallyn <serue@us.ibm.com> | Reviewed-by: Oren Laadan <orenl@cs.columbia.edu> | --- | include/linux/sched.h | 3 +++ | kernel/fork.c | 17 +++++++++++++++-- | 2 files changed, 18 insertions(+), 2 deletions(-) | | diff --git a/include/linux/sched.h b/include/linux/sched.h | index a4d2c23..85e971a 100644 | --- a/include/linux/sched.h | +++ b/include/linux/sched.h | @@ -2153,6 +2153,9 @@ extern int disallow_signal(int); | | extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); | extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); | +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, | + unsigned long, int __user *, int __user *, | + unsigned int, pid_t __user *); | struct task_struct *fork_idle(int); | | extern void set_task_comm(struct task_struct *tsk, char *from); | diff --git a/kernel/fork.c b/kernel/fork.c | index 11f77ed..210e841 100644 | --- a/kernel/fork.c | +++ b/kernel/fork.c | @@ -1377,12 +1377,14 @@ struct task_struct * __cpuinit fork_idle(int cpu) | * It copies the process, and if successful kick-starts | * it and waits for it to finish using the VM if required. | */ | -long do_fork(unsigned long clone_flags, | +long do_fork_with_pids(unsigned long clone_flags, | unsigned long stack_start, | struct pt_regs *regs, | unsigned long stack_size, | int __user *parent_tidptr, | - int __user *child_tidptr) | + int __user *child_tidptr, | + unsigned int num_pids, | + pid_t __user *upids) | { | struct task_struct *p; | int trace = 0; | @@ -1485,6 +1487,17 @@ long do_fork(unsigned long clone_flags, | return nr; | } | | +long do_fork(unsigned long clone_flags, | + unsigned long stack_start, | + struct pt_regs *regs, | + unsigned long stack_size, | + int __user *parent_tidptr, | + int __user *child_tidptr) | +{ | + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, | + parent_tidptr, child_tidptr, 0, NULL); | +} | + | #ifndef ARCH_MIN_MMSTRUCT_ALIGN | #define ARCH_MIN_MMSTRUCT_ALIGN 0 | #endif | -- | 1.6.0.4 | | _______________________________________________ | Containers mailing list | Containers@lists.linux-foundation.org | https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20091111044509.GH11393@suka>]
* Re: [v12][PATCH 8/9] Define eclone() syscall [not found] ` <20091111044509.GH11393@suka> @ 2009-11-11 22:40 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 10+ messages in thread From: Sukadev Bhattiprolu @ 2009-11-11 22:40 UTC (permalink / raw) To: Andrew Morton Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc, Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland, Pavel Emelyanov, linux-kernel cc: lkml Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote: | | Subject: [v12][PATCH 8/9] Define eclone() syscall | | Container restart requires that a task have the same pid it had when it was | checkpointed. When containers are nested the tasks within the containers | exist in multiple pid namespaces and hence have multiple pids to specify | during restart. | | eclone(), intended for use during restart, is the same as | clone(), except that it takes a 'pids' paramter. This parameter lets | caller choose specific pid numbers for the child process, in the | process's active and ancestor pid namespaces. (Descendant pid namespaces | in general don't matter since processes don't have pids in them anyway, | but see comments in copy_target_pids() regarding CLONE_NEWPID). | | eclone() also attempts to address a second limitation of the | clone() system call. clone() is restricted to 32 clone flags and all but | one of these are in use. If more new clone flags are needed, we will be | forced to define a new variant of the clone() system call. To address | this, eclone() allows at least 64 clone flags with some room | for more if necessary. | | To prevent unprivileged processes from misusing this interface, | eclone() currently needs CAP_SYS_ADMIN, when the 'pids' parameter | is non-NULL. | | See Documentation/eclone in next patch for more details and an | example of its usage. | | NOTE: | - System calls are restricted to 6 parameters and the number and sizes | of parameters needed for eclone() exceed 6 integers. The new | prototype works around this restriction while providing some | flexibility if eclone() needs to be further extended in the | future. | TODO: | - We should convert clone-flags to 64-bit value in all architectures. | Its probably best to do that as a separate patchset since clone_flags | touches several functions and that patchset seems independent of this | new system call. | | Changelog[v12]: | - [Serge Hallyn] Ignore ->child_stack_size if ->child_stack_base | is NULL. | - [Oren Laadan, Serge Hallyn] Rename clone_with_pids() to eclone() | Changelog[v11]: | - [Dave Hansen] Move clone_args validation checks to arch-indpeendent | code. | - [Oren Laadan] Make args_size a parameter to system call and remove | it from 'struct clone_args' | | Changelog[v10]: | - Rename clone3() to clone_with_pids() | - [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall | implementation | | Changelog[v9]: | - [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit | architectures split the new clone-flags into 'low' and 'high' | words and pass in the 'lower' flags as the first argument. | This would maintain similarity of the clone3() with clone()/ | clone2(). Also has the side-effect of the name matching the | number of parameters :-) | - [Roland McGrath] Rename structure to 'clone_args' and add a | 'child_stack_size' field | | Changelog[v8] | - [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg' | must be 64-bit. | - clone2() is in use in IA64. Rename system call to clone3(). | | Changelog[v7]: | - [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2() | and group parameters into a new 'struct clone_struct' object. | | Changelog[v6]: | - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds) | Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set' | is constant across architectures. | - (Nathan Lynch) Change pid_set.num_pids to unsigned and remove | 'unum_pids < 0' check. | | Changelog[v4]: | - (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set' | | Changelog[v3]: | - (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid | in the target_pids[] list and setting it 0. See copy_target_pids()). | - (Oren Laadan) Specified target pids should apply only to youngest | pid-namespaces (see copy_target_pids()) | - (Matt Helsley) Update patch description. | | Changelog[v2]: | - Remove unnecessary printk and add a note to callers of | copy_target_pids() to free target_pids. | - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description. | - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and | 'num_pids == 0' (fall back to normal clone()). | - Move arch-independent code (sanity checks and copy-in of target-pids) | into kernel/fork.c and simplify sys_clone_with_pids() | | Changelog[v1]: | - Fixed some compile errors (had fixed these errors earlier in my | git tree but had not refreshed patches before emailing them) | | Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> | Acked-by: Oren Laadan <orenl@cs.columbia.edu> | --- | arch/x86/include/asm/syscalls.h | 1 + | arch/x86/include/asm/unistd_32.h | 1 + | arch/x86/kernel/entry_32.S | 1 + | arch/x86/kernel/process_32.c | 47 ++++++++++++++ | arch/x86/kernel/syscall_table_32.S | 1 + | include/linux/sched.h | 2 + | include/linux/types.h | 11 +++ | kernel/fork.c | 124 +++++++++++++++++++++++++++++++++++- | 8 files changed, 187 insertions(+), 1 deletions(-) | | diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h | index 372b76e..2cadb8e 100644 | --- a/arch/x86/include/asm/syscalls.h | +++ b/arch/x86/include/asm/syscalls.h | @@ -40,6 +40,7 @@ long sys_iopl(struct pt_regs *); | | /* kernel/process_32.c */ | int sys_clone(struct pt_regs *); | +int sys_eclone(struct pt_regs *); | int sys_execve(struct pt_regs *); | | /* kernel/signal.c */ | diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h | index 6fb3c20..030b121 100644 | --- a/arch/x86/include/asm/unistd_32.h | +++ b/arch/x86/include/asm/unistd_32.h | @@ -342,6 +342,7 @@ | #define __NR_pwritev 334 | #define __NR_rt_tgsigqueueinfo 335 | #define __NR_perf_event_open 336 | +#define __NR_eclone 337 | | #ifdef __KERNEL__ | | diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S | index c097e7d..7e7f3c8 100644 | --- a/arch/x86/kernel/entry_32.S | +++ b/arch/x86/kernel/entry_32.S | @@ -718,6 +718,7 @@ ptregs_##name: \ | PTREGSCALL(iopl) | PTREGSCALL(fork) | PTREGSCALL(clone) | +PTREGSCALL(eclone) | PTREGSCALL(vfork) | PTREGSCALL(execve) | PTREGSCALL(sigaltstack) | diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c | index 4cf7956..be7c1a1 100644 | --- a/arch/x86/kernel/process_32.c | +++ b/arch/x86/kernel/process_32.c | @@ -445,6 +445,53 @@ int sys_clone(struct pt_regs *regs) | return do_fork(clone_flags, newsp, regs, 0, parent_tidptr, child_tidptr); | } | | +int sys_eclone(struct pt_regs *regs) | +{ | + int rc; | + int args_size; | + struct clone_args kca; | + unsigned long flags; | + int __user *parent_tid_ptr; | + int __user *child_tid_ptr; | + unsigned long __user child_stack; | + unsigned long stack_size; | + unsigned int flags_low; | + struct clone_args __user *uca; | + pid_t __user *pids; | + | + flags_low = regs->bx; | + uca = (struct clone_args __user *)regs->cx; | + args_size = regs->dx; | + pids = (pid_t __user *)regs->di; | + | + rc = fetch_clone_args_from_user(uca, args_size, &kca); | + if (rc) | + return rc; | + | + /* | + * TODO: Convert 'clone-flags' to 64-bits on all architectures. | + * TODO: When ->clone_flags_high is non-zero, copy it in to the | + * higher word(s) of 'flags': | + * | + * flags = (kca.clone_flags_high << 32) | flags_low; | + */ | + flags = flags_low; | + parent_tid_ptr = (int *)kca.parent_tid_ptr; | + child_tid_ptr = (int *)kca.child_tid_ptr; | + stack_size = (unsigned long)kca.child_stack_size; | + | + child_stack = 0UL; | + if (kca.child_stack_base) | + child_stack = (unsigned long)kca.child_stack_base + stack_size; | + | + if (!child_stack) | + child_stack = regs->sp; | + | + return do_fork_with_pids(flags, child_stack, regs, stack_size, | + parent_tid_ptr, child_tid_ptr, kca.nr_pids, | + pids); | +} | + | /* | * sys_execve() executes a new program. | */ | diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S | index 0157cd2..2b6394c 100644 | --- a/arch/x86/kernel/syscall_table_32.S | +++ b/arch/x86/kernel/syscall_table_32.S | @@ -336,3 +336,4 @@ ENTRY(sys_call_table) | .long sys_pwritev | .long sys_rt_tgsigqueueinfo /* 335 */ | .long sys_perf_event_open | + .long ptregs_eclone | diff --git a/include/linux/sched.h b/include/linux/sched.h | index 85e971a..b2a881d 100644 | --- a/include/linux/sched.h | +++ b/include/linux/sched.h | @@ -2153,6 +2153,8 @@ extern int disallow_signal(int); | | extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); | extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); | +extern int fetch_clone_args_from_user(struct clone_args __user *, int, | + struct clone_args *); | extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, | unsigned long, int __user *, int __user *, | unsigned int, pid_t __user *); | diff --git a/include/linux/types.h b/include/linux/types.h | index c42724f..0d08683 100644 | --- a/include/linux/types.h | +++ b/include/linux/types.h | @@ -204,6 +204,17 @@ struct ustat { | char f_fpack[6]; | }; | | +struct clone_args { | + u64 clone_flags_high; | + u64 child_stack_base; | + u64 child_stack_size; | + u64 parent_tid_ptr; | + u64 child_tid_ptr; | + u32 nr_pids; | + u32 reserved0; | + u64 reserved1; | +}; | + | #endif /* __KERNEL__ */ | #endif /* __ASSEMBLY__ */ | #endif /* _LINUX_TYPES_H */ | diff --git a/kernel/fork.c b/kernel/fork.c | index 210e841..cacc26b 100644 | --- a/kernel/fork.c | +++ b/kernel/fork.c | @@ -1372,6 +1372,114 @@ struct task_struct * __cpuinit fork_idle(int cpu) | } | | /* | + * If user specified any 'target-pids' in @upid_setp, copy them from | + * user and return a pointer to a local copy of the list of pids. The | + * caller must free the list, when they are done using it. | + * | + * If user did not specify any target pids, return NULL (caller should | + * treat this like normal clone). | + * | + * On any errors, return the error code | + */ | +static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids) | +{ | + int j; | + int rc; | + int size; | + int knum_pids; /* # of pids needed in kernel */ | + pid_t *target_pids; | + | + if (!unum_pids) | + return NULL; | + | + knum_pids = task_pid(current)->level + 1; | + if (unum_pids > knum_pids) | + return ERR_PTR(-EINVAL); | + | + /* | + * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[] | + * and set it to 0. This last entry in target_pids[] corresponds to the | + * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was | + * specified. If CLONE_NEWPID was not specified, this last entry will | + * simply be ignored. | + */ | + target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL); | + if (!target_pids) | + return ERR_PTR(-ENOMEM); | + | + /* | + * A process running in a level 2 pid namespace has three pid namespaces | + * and hence three pid numbers. If this process is checkpointed, | + * information about these three namespaces are saved. We refer to these | + * namespaces as 'known namespaces'. | + * | + * If this checkpointed process is however restarted in a level 3 pid | + * namespace, the restarted process has an extra ancestor pid namespace | + * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'. | + * | + * During restart, the process requests specific pids for its 'known | + * namespaces' and lets kernel assign pids to its 'unknown namespaces'. | + * | + * Since the requested-pids correspond to 'known namespaces' and since | + * 'known-namespaces' are younger than (i.e descendants of) 'unknown- | + * namespaces', copy requested pids to the back-end of target_pids[] | + * (i.e before the last entry for CLONE_NEWPID mentioned above). | + * Any entries in target_pids[] not corresponding to a requested pid | + * will be set to zero and kernel assigns a pid in those namespaces. | + * | + * NOTE: The order of pids in target_pids[] is oldest pid namespace to | + * youngest (target_pids[0] corresponds to init_pid_ns). i.e. | + * the order is: | + * | + * - pids for 'unknown-namespaces' (if any) | + * - pids for 'known-namespaces' (requested pids) | + * - 0 in the last entry (for CLONE_NEWPID). | + */ | + j = knum_pids - unum_pids; | + size = unum_pids * sizeof(pid_t); | + | + rc = copy_from_user(&target_pids[j], upids, size); | + if (rc) { | + rc = -EFAULT; | + goto out_free; | + } | + | + return target_pids; | + | +out_free: | + kfree(target_pids); | + return ERR_PTR(rc); | +} | + | +int | +fetch_clone_args_from_user(struct clone_args __user *uca, int args_size, | + struct clone_args *kca) | +{ | + int rc; | + | + /* | + * TODO: If size of clone_args is not what the kernel expects, it | + * could be that kernel is newer and has an extended structure. | + * When that happens, this check needs to be smarter. For now, | + * assume exact match. | + */ | + if (args_size != sizeof(struct clone_args)) | + return -EINVAL; | + | + rc = copy_from_user(kca, uca, args_size); | + if (rc) | + return -EFAULT; | + | + /* | + * To avoid future compatibility issues, ensure unused fields are 0. | + */ | + if (kca->reserved0 || kca->reserved1 || kca->clone_flags_high) | + return -EINVAL; | + | + return 0; | +} | + | +/* | * Ok, this is the main fork-routine. | * | * It copies the process, and if successful kick-starts | @@ -1389,7 +1497,7 @@ long do_fork_with_pids(unsigned long clone_flags, | struct task_struct *p; | int trace = 0; | long nr; | - pid_t *target_pids = NULL; | + pid_t *target_pids; | | /* | * Do some preliminary argument and permissions checking before we | @@ -1423,6 +1531,16 @@ long do_fork_with_pids(unsigned long clone_flags, | } | } | | + target_pids = copy_target_pids(num_pids, upids); | + if (target_pids) { | + if (IS_ERR(target_pids)) | + return PTR_ERR(target_pids); | + | + nr = -EPERM; | + if (!capable(CAP_SYS_ADMIN)) | + goto out_free; | + } | + | /* | * When called from kernel_thread, don't do user tracing stuff. | */ | @@ -1484,6 +1602,10 @@ long do_fork_with_pids(unsigned long clone_flags, | } else { | nr = PTR_ERR(p); | } | + | +out_free: | + kfree(target_pids); | + | return nr; | } | | -- | 1.6.0.4 | | _______________________________________________ | Containers mailing list | Containers@lists.linux-foundation.org | https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20091111044527.GI11393@suka>]
* Re: [v12][PATCH 9/9] Document eclone() syscall [not found] ` <20091111044527.GI11393@suka> @ 2009-11-11 22:41 ` Sukadev Bhattiprolu 0 siblings, 0 replies; 10+ messages in thread From: Sukadev Bhattiprolu @ 2009-11-11 22:41 UTC (permalink / raw) To: Andrew Morton Cc: mtk.manpages, arnd, Containers, Nathan Lynch, matthltc, Eric W. Biederman, hpa, linux-api, Alexey Dobriyan, roland, Pavel Emelyanov, linux-kernel CC: LKML Sukadev Bhattiprolu [sukadev@linux.vnet.ibm.com] wrote: | | Subject: [v12][PATCH 9/9] Document eclone() syscall | | This gives a brief overview of the eclone() system call. We should | eventually describe more details in existing clone(2) man page or in | a new man page. | | Changelog[v12]: | - [Serge Hallyn] Fix/simplify stack-setup in the example code | - [Serge Hallyn, Oren Laadan] Rename syscall to eclone() | | Changelog[v11]: | - [Dave Hansen] Move clone_args validation checks to arch-indpendent | code. | - [Oren Laadan] Make args_size a parameter to system call and remove | it from 'struct clone_args' | - [Oren Laadan] Fix some typos and clarify the order of pids in the | @pids parameter. | | Changelog[v10]: | - Rename clone3() to clone_with_pids() and fix some typos. | - Modify example to show usage with the ptregs implementation. | Changelog[v9]: | - [Pavel Machek]: Fix an inconsistency and rename new file to | Documentation/clone3. | - [Roland McGrath, H. Peter Anvin] Updates to description and | example to reflect new prototype of clone3() and the updated/ | renamed 'struct clone_args'. | | Changelog[v8]: | - clone2() is already in use in IA64. Rename syscall to clone3() | - Add notes to say that we return -EINVAL if invalid clone flags | are specified or if the reserved fields are not 0. | Changelog[v7]: | - Rename clone_with_pids() to clone2() | - Changes to reflect new prototype of clone2() (using clone_struct). | | Signed-off-by: Sukadev Bhattiprolu <sukadev@vnet.linux.ibm.com> | Acked-by: Oren Laadan <orenl@cs.columbia.edu> | --- | Documentation/eclone | 345 ++++++++++++++++++++++++++++++++++++++++++++++++++ | 1 files changed, 345 insertions(+), 0 deletions(-) | create mode 100644 Documentation/eclone | | diff --git a/Documentation/eclone b/Documentation/eclone | new file mode 100644 | index 0000000..9b4ec84 | --- /dev/null | +++ b/Documentation/eclone | @@ -0,0 +1,345 @@ | + | +struct clone_args { | + u64 clone_flags_high; | + u64 child_stack_base; | + u64 child_stack_size; | + u64 parent_tid_ptr; | + u64 child_tid_ptr; | + u32 nr_pids; | + u32 reserved0; | + u64 reserved1; | +}; | + | + | +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, | + pid_t * __user pids) | + | + In addition to doing everything that clone() system call does, the | + eclone() system call: | + | + - allows additional clone flags (31 of 32 bits in the flags | + parameter to clone() are in use) | + | + - allows user to specify a pid for the child process in its | + active and ancestor pid namespaces. | + | + This system call is meant to be used when restarting an application | + from a checkpoint. Such restart requires that the processes in the | + application have the same pids they had when the application was | + checkpointed. When containers are nested, the processes within the | + containers exist in multiple pid namespaces and hence have multiple | + pids to specify during restart. | + | + The @flags_low parameter is identical to the 'clone_flags' parameter | + in existing clone() system call. | + | + The fields in 'struct clone_args' are meant to be used as follows: | + | + u64 clone_flags_high: | + | + When eclone() supports more than 32 flags, the additional bits | + in the clone_flags should be specified in this field. This | + field is currently unused and must be set to 0. | + | + u64 child_stack_base; | + u64 child_stack_size; | + | + These two fields correspond to the 'child_stack' fields in | + clone() and clone2() (on IA64) system calls. | + | + u64 parent_tid_ptr; | + u64 child_tid_ptr; | + | + These two fields correspond to the 'parent_tid_ptr' and | + 'child_tid_ptr' fields in the clone() system call | + | + u32 nr_pids; | + | + nr_pids specifies the number of pids in the @pids array | + parameter to eclone() (see below). nr_pids should not exceed | + the current nesting level of the calling process (i.e if the | + process is in init_pid_ns, nr_pids must be 1, if process is | + in a pid namespace that is a child of init-pid-ns, nr_pids | + cannot exceed 2, and so on). | + | + u32 reserved0; | + u64 reserved1; | + | + These fields are intended to extend the functionality of the | + eclone() in the future, while preserving backward compatibility. | + They must be set to 0 for now. | + | + The @cargs_size parameter specifes the sizeof(struct clone_args) and | + is intended to enable extending this structure in the future, while | + preserving backward compatibility. For now, this field must be set | + to the sizeof(struct clone_args) and this size must match the kernel's | + view of the structure. | + | + The @pids parameter defines the set of pids that should be assigned to | + the child process in its active and ancestor pid namespaces. The | + descendant pid namespaces do not matter since a process does not have a | + pid in descendant namespaces, unless the process is in a new pid | + namespace in which case the process is a container-init (and must have | + the pid 1 in that namespace). | + | + See CLONE_NEWPID section of clone(2) man page for details about pid | + namespaces. | + | + If a pid in the @pids list is 0, the kernel will assign the next | + available pid in the pid namespace. | + | + If a pid in the @pids list is non-zero, the kernel tries to assign | + the specified pid in that namespace. If that pid is already in use | + by another process, the system call fails (see EBUSY below). | + | + The order of pids in @pids is oldest in pids[0] to youngest pid | + namespace in pids[nr_pids-1]. If the number of pids specified in the | + @pids list is fewer than the nesting level of the process, the pids | + are applied from youngest namespace. i.e if the process is nested in | + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids | + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to | + have a pid of '0' (the kernel will assign a pid in those namespaces). | + | + On success, the system call returns the pid of the child process in | + the parent's active pid namespace. | + | + On failure, eclone() returns -1 and sets 'errno' to one of following | + values (the child process is not created). | + | + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to | + specify the pids in this call (if pids are not specifed | + CAP_SYS_ADMIN is not required). | + | + EINVAL The number of pids specified in 'clone_args.nr_pids' exceeds | + the current nesting level of parent process | + | + EINVAL Not all specified clone-flags are valid. | + | + EINVAL The reserved fields in the clone_args argument are not 0. | + | + EBUSY A requested pid is in use by another process in that namespace. | + | +--- | +/* | + * Example eclone() usage - Create a child process with pid CHILD_TID1 in | + * the current pid namespace. The child gets the usual "random" pid in any | + * ancestor pid namespaces. | + */ | +#include <stdio.h> | +#include <stdlib.h> | +#include <string.h> | +#include <signal.h> | +#include <errno.h> | +#include <unistd.h> | +#include <wait.h> | +#include <sys/syscall.h> | + | +#define __NR_eclone 337 | +#define CLONE_NEWPID 0x20000000 | +#define CLONE_CHILD_SETTID 0x01000000 | +#define CLONE_PARENT_SETTID 0x00100000 | +#define CLONE_UNUSED 0x00001000 | + | +#define STACKSIZE 8192 | + | +typedef unsigned long long u64; | +typedef unsigned int u32; | +typedef int pid_t; | +struct clone_args { | + u64 clone_flags_high; | + | + u64 child_stack_base; | + u64 child_stack_size; | + | + u64 parent_tid_ptr; | + u64 child_tid_ptr; | + | + u32 nr_pids; | + | + u32 reserved0; | + u64 reserved1; | +}; | + | +#define exit _exit | + | +/* | + * Following eclone() is based on code posted by Oren Laadan at: | + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html | + */ | +#if defined(__i386__) && defined(__NR_eclone) | + | +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size, | + int *pids) | +{ | + long retval; | + | + __asm__ __volatile__( | + "movl %0, %%ebx\n\t" /* flags -> 1st (ebx) */ | + "movl %1, %%ecx\n\t" /* clone_args -> 2nd (ecx)*/ | + "movl %2, %%edx\n\t" /* args_size -> 3rd (edx) */ | + "movl %3, %%edi\n\t" /* pids -> 4th (edi)*/ | + "pushl %%ebp\n\t" /* save value of ebp */ | + : | + :"b" (flags_low), | + "c" (clone_args), | + "d" (args_size), | + "D" (pids) | + ); | + | + __asm__ __volatile__( | + "int $0x80\n\t" /* Linux/i386 system call */ | + "testl %0,%0\n\t" /* check return value */ | + "jne 1f\n\t" /* jump if parent */ | + "popl %%ebx\n\t" /* get subthread function */ | + "call *%%ebx\n\t" /* start subthread function */ | + "movl %2,%0\n\t" | + "int $0x80\n" /* exit system call: exit subthread */ | + "1:\n\t" | + "popl %%ebp\t" /* restore parent's ebp */ | + :"=a" (retval) | + :"0" (__NR_eclone), "i" (__NR_exit) | + :"ebx", "ecx", "edx" | + ); | + | + if (retval < 0) { | + errno = -retval; | + retval = -1; | + } | + return retval; | +} | + | +/* | + * Allocate a stack for the clone-child and arrange to have the child | + * execute @child_fn with @child_arg as the argument. | + */ | +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size) | +{ | + void *stack_base; | + void **stack_top; | + int frame_size; | + | + /* | + * Allocate extra space to push @child_arg and @child_fn near | + * top of stack | + */ | + frame_size= 12; | + | + stack_base = malloc(size + frame_size); | + if (!stack_base) { | + perror("malloc()"); | + exit(1); | + } | + | + stack_top = (void **)((char *)stack_base + (size + frame_size - 4)); | + *--stack_top = child_arg; | + *--stack_top = child_fn; | + | + return stack_base; | +} | +#endif | + | +/* gettid() is a bit more useful than getpid() when messing with clone() */ | +int gettid() | +{ | + int rc; | + | + rc = syscall(__NR_gettid, 0, 0, 0); | + if (rc < 0) { | + printf("rc %d, errno %d\n", rc, errno); | + exit(1); | + } | + return rc; | +} | + | +#define CHILD_TID1 377 | +#define CHILD_TID2 1177 | +#define CHILD_TID3 2799 | + | +struct clone_args clone_args; | +void *child_arg = &clone_args; | +int child_tid; | + | +int do_child(void *arg) | +{ | + struct clone_args *cs = (struct clone_args *)arg; | + int ctid; | + | + /* Verify we pushed the arguments correctly on the stack... */ | + if (arg != child_arg) { | + printf("Child: Incorrect child arg pointer, expected %p," | + "actual %p\n", child_arg, arg); | + exit(1); | + } | + | + /* ... and that we got the thread-id we expected */ | + ctid = *((int *)cs->child_tid_ptr); | + if (ctid != CHILD_TID1) { | + printf("Child: Incorrect child tid, expected %d, actual %d\n", | + CHILD_TID1, ctid); | + exit(1); | + } else { | + printf("Child got the expected tid, %d\n", gettid()); | + } | + sleep(3); | + | + printf("[%d, %d]: Child exiting\n", getpid(), ctid); | + exit(0); | +} | + | +static int do_clone(int (*child_fn)(void *), void *child_arg, | + unsigned int flags_low, int nr_pids, pid_t *pids_list) | +{ | + int rc; | + void *stack; | + struct clone_args *ca = &clone_args; | + int args_size; | + | + stack = setup_stack(child_fn, child_arg, STACKSIZE); | + | + memset(ca, 0, sizeof(*ca)); | + | + ca->child_stack_base = (u64)stack; | + ca->child_stack_size = (u64)STACKSIZE; | + ca->child_tid_ptr = (u64)&child_tid; | + ca->nr_pids = nr_pids; | + | + args_size = sizeof(struct clone_args); | + rc = eclone(flags_low, ca, args_size, pids_list); | + | + printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(), | + rc, errno); | + return rc; | +} | + | +/* | + * Multiple pid_t pid_t values in pids_list[] here are just for illustration. | + * The test case creates a child in the current pid namespace and uses only | + * the first value, CHILD_TID1 (note that nr_pids == 1). | + */ | +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 }; | +main() | +{ | + int rc, pid, ret, status; | + unsigned long flags; | + int nr_pids = 1; | + | + flags = SIGCHLD|CLONE_CHILD_SETTID; | + | + pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list); | + | + printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid); | + | + rc = waitpid(pid, &status, __WALL); | + if (rc < 0) { | + printf("waitpid(): rc %d, error %d\n", rc, errno); | + } else { | + printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(), | + gettid(), rc, status); | + | + if (WIFEXITED(status)) { | + printf("\t EXITED, %d\n", WEXITSTATUS(status)); | + } else if (WIFSIGNALED(status)) { | + printf("\t SIGNALED, %d\n", WTERMSIG(status)); | + } | + } | +} | -- | 1.6.0.4 | | _______________________________________________ | Containers mailing list | Containers@lists.linux-foundation.org | https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2009-11-11 22:40 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20091111043440.GA9377@suka>
2009-11-11 22:38 ` [v12][PATCH 0/9] Implement eclone() syscall Sukadev Bhattiprolu
[not found] ` <20091111044250.GA11393@suka>
2009-11-11 22:38 ` [v12][PATCH 1/9] Factor out code to allocate pidmap page Sukadev Bhattiprolu
[not found] ` <20091111044313.GB11393@suka>
2009-11-11 22:39 ` [v12][PATCH 2/9] Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
[not found] ` <20091111044329.GC11393@suka>
2009-11-11 22:39 ` [v12][PATCH 3/9] Define set_pidmap() function Sukadev Bhattiprolu
[not found] ` <20091111044347.GD11393@suka>
2009-11-11 22:39 ` [v12][PATCH 4/9] Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
[not found] ` <20091111044403.GE11393@suka>
2009-11-11 22:40 ` [v12][PATCH 5/9] Add target_pids parameter to copy_process() Sukadev Bhattiprolu
[not found] ` <20091111044422.GF11393@suka>
2009-11-11 22:40 ` [v12][PATCH 6/9] Check invalid clone flags Sukadev Bhattiprolu
[not found] ` <20091111044438.GG11393@suka>
2009-11-11 22:40 ` [v12][PATCH 7/9] Define do_fork_with_pids() Sukadev Bhattiprolu
[not found] ` <20091111044509.GH11393@suka>
2009-11-11 22:40 ` [v12][PATCH 8/9] Define eclone() syscall Sukadev Bhattiprolu
[not found] ` <20091111044527.GI11393@suka>
2009-11-11 22:41 ` [v12][PATCH 9/9] Document " Sukadev Bhattiprolu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox