* [PATCH 00/11][v15]: Implement eclone() system call
@ 2010-07-03 20:32 Sukadev Bhattiprolu
[not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
0 siblings, 1 reply; 31+ messages in thread
From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw)
To: Oren Laadan, Serge Hallyn
Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers,
Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath
To support application checkpoint/restart, a task must have the same pid it
had when it was checkpointed. When containers are nested, the tasks within
the containers exist in multiple pid namespaces and hence have multiple pids
to specify during restart.
This patchset implements a new system call, eclone() that lets a process
specify the pids of the child process.
Summary:
Patches 1 through 7 are helper patches needed for choosing a pid
for the child process.
Patches 8 through 10 implement the eclone() system call on x86,
x86_64, s390 and powerpc.
Patch 11 documents the new system call, some/all of which will
eventually go into a man page.
Changelog[v15]:
- [Albert Cahalan, Randy Dunlap]: Specify stack as [base, offset]
on all architectures rather than [base, offset] on a few and
stack pointer on others.
- [Randy Dunlap] Fix typos in documentation and pointer to usage
examples of eclone()
Changelog[v14]:
- Updates to documentaiton
Changelog[v13]:
- Implement sys_eclone() on x86_64, s390 and powerpc architectures
- Reorg x86 implementation to enable sharing code with x86_64
- [Arnd Bergmann] Remove the ->reserved1 field we now have args_size
- [Nathan Lynch, Serge Hallyn]: Rename ->child_stack_base to
->child_stack and ensure ->child_stack_size is 0 on architectures
that don't need the stack size.
- Modify exmaple in Documentation to avoid unnecessary register copy.
Changelog[v12]:
- Ignore ->child_stack_size when ->child_stack_base is NULL (PATCH 8)
- Cleanup/simplify example in Documentation/eclone (PATCH 9).
- Rename sys call to a shorter name, eclone()
Changelog[v11]:
- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
code.
- [Oren Laadan] Make args_size a parameter to system call and remove
it from 'struct clone_args'
Changelog[v10]:
- [Linus Torvalds] Use PTREGSCALL() implementation for clone rather
than the generic system call
- Rename clone3() to clone_with_pids()
- Update Documentation/clone_with_pids() to show example usage with
the PTREGSCALL implementation.
Changelog[v9]:
- [Pavel Emelyanov] Drop the patch that made 'pid_max' a property
of struct pid_namespace
- [Roland McGrath, H. Peter Anvin and earlier on, Serge Hallyn] To
avoid inadvertent truncation clone_flags, preserve the first
parameter of clone3() as 'u32 clone_flags' and specify newer
flags in clone_args.flags_high (PATCH 8/9 and PATCH 9/9)
- [Eric Biederman] Generalize alloc_pidmap() code to simplify and
remove duplication (see PATCH 3/9].
Changelog[v8]:
- [Oren Laadan, Louis Rilling, KOSAKI Motohiro]
The name 'clone2()' is in use - renamed new syscall to clone3().
- [Oren Laadan] ->parent_tidptr and ->child_tidptr need to be 64bit.
- [Oren Laadan] Ensure that unused fields/flags in clone_struct are 0.
(Added [PATCH 7/10] to the patchset).
Changelog[v7]:
- [Peter Zijlstra, Arnd Bergmann]
Group the arguments to clone2() into a 'struct clone_arg' to
workaround the issue of exceeding 6 arguments to the system call.
Also define clone-flags as u64 to allow additional clone-flags.
Changelog[v6]:
- [Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds]
Change 'pid_set.pids' to 'pid_t pids[]' so sizeof(struct pid_set) is
constant across architectures (Patches 7, 8).
- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
'unum_pids < 0' check (Patches 7,8)
- (Pavel Machek) New patch (Patch 9) to add some documentation.
Changelog[v5]:
- Make 'pid_max' a property of pid_ns (Integrated Serge Hallyn's patch
into this set)
- (Eric Biederman): Avoid the new function, set_pidmap() - added
couple of checks on 'target_pid' in alloc_pidmap() itself.
=== IMPORTANT NOTE:
clone() system call has another limitation - all but one bits in clone-flags
are in use and if more new clone-flags are needed, we will need a variant of
the clone() system call.
It appears to make sense to try and extend this new system call to address
this limitation as well. The requirements of a new clone system call could
then be summarized as:
- do everything clone() does today, and
- give application an ability to choose pids for the child process
in all ancestor pid namespaces, and
- allow more clone_flags
Contstraints:
- system-calls are restricted to 6 parameters and clone() already
takes 5 parameters, any extension to clone() interface would require
one or more copy_from_user(). (Not sure if copy_from_user() of ~40
bytes would have a significant impact on performance of clone()).
Based on these requirements and constraints, we explored a couple of system
call interfaces (in earlier versions of this patchset). Based on input from
Arnd Bergmann and others, the new interface of the system call is:
struct clone_args {
u64 clone_flags_high;
u64 child_stack_base;
u64 child_stack_size;
u64 parent_tid_ptr;
u64 child_tid_ptr;
u32 nr_pids;
u32 reserved0;
};
sys_eclone(u32 flags_low, struct clone_args *cargs, int args_size,
pid_t *pids)
Details of the struct clone_args and the usage are explained in the
documentation (PATCH 11/11).
NOTE:
While this patchset enables support for more clone-flags, actual
implementation for additional clone-flags is best implemented as
a separate patchset (PATCH 8/9 identifies some TODOs)
Nathan Lynch (1):
eclone (10/11): Implement sys_eclone for powerpc
Serge E. Hallyn (1):
eclone (9/11): Implement sys_eclone for s390
Sukadev Bhattiprolu (9):
eclone (1/11): Factor out code to allocate pidmap page
eclone (2/11): Have alloc_pidmap() return actual error code
eclone (3/11): Define set_pidmap() function
eclone (4/11): Add target_pids parameter to alloc_pid()
eclone (5/11): Add target_pids parameter to copy_process()
eclone (6/11): Check invalid clone flags
eclone (7/11): Define do_fork_with_pids()
eclone (8/11): Implement sys_eclone for x86 (32,64)
eclone (11/11): Document sys_eclone
Documentation/eclone | 354 +++++++++++++++++++++++++++++++++++
arch/powerpc/include/asm/syscalls.h | 6 +
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/asm/unistd.h | 3 +-
arch/powerpc/kernel/entry_32.S | 8 +
arch/powerpc/kernel/entry_64.S | 5 +
arch/powerpc/kernel/process.c | 62 ++++++-
arch/s390/include/asm/unistd.h | 3 +-
arch/s390/kernel/compat_linux.c | 17 ++
arch/s390/kernel/compat_wrapper.S | 8 +
arch/s390/kernel/process.c | 39 ++++
arch/s390/kernel/syscalls.S | 1 +
arch/x86/ia32/ia32entry.S | 2 +
arch/x86/include/asm/syscalls.h | 2 +
arch/x86/include/asm/unistd_32.h | 3 +-
arch/x86/include/asm/unistd_64.h | 2 +
arch/x86/kernel/entry_32.S | 14 ++
arch/x86/kernel/entry_64.S | 1 +
arch/x86/kernel/process.c | 43 ++++-
arch/x86/kernel/syscall_table_32.S | 1 +
include/linux/pid.h | 2 +-
include/linux/sched.h | 17 ++
include/linux/types.h | 10 +
kernel/fork.c | 155 +++++++++++++++-
kernel/pid.c | 101 +++++++---
25 files changed, 818 insertions(+), 42 deletions(-)
create mode 100644 Documentation/eclone
Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
^ permalink raw reply [flat|nested] 31+ messages in thread[parent not found: <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* [PATCH 01/11][v15]: Factor out code to allocate pidmap page [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2010-07-03 20:32 ` Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 02/11][v15]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu ` (9 subsequent siblings) 10 siblings, 0 replies; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath To simplify alloc_pidmap(), move code to allocate a pid map page to a separate function. Changelog[v4]: - [Oren Laadan] Adapt to kernel 2.6.33-rc5 Changelog[v3]: - Earlier version of patchset called alloc_pidmap_page() from two places. But now its called from only one place. Even so, moving this code out into a separate function simplifies alloc_pidmap(). Changelog[v2]: - (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return -ENOMEM on error instead of -1. Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- kernel/pid.c | 41 ++++++++++++++++++++++++++--------------- 1 files changed, 26 insertions(+), 15 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index aebb30d..52a371a 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -122,6 +122,30 @@ static void free_pidmap(struct upid *upid) atomic_inc(&map->nr_free); } +static int alloc_pidmap_page(struct pidmap *map) +{ + void *page; + + if (likely(map->page)) + return 0; + + page = kzalloc(PAGE_SIZE, GFP_KERNEL); + /* + * Free the page if someone raced with us installing it: + */ + spin_lock_irq(&pidmap_lock); + if (!map->page) { + map->page = page; + page = NULL; + } + spin_unlock_irq(&pidmap_lock); + kfree(page); + if (unlikely(!map->page)) + return -1; + + return 0; +} + static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns->last_pid; @@ -134,22 +158,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i <= max_scan; ++i) { - if (unlikely(!map->page)) { - void *page = kzalloc(PAGE_SIZE, GFP_KERNEL); - /* - * Free the page if someone raced with us - * installing it: - */ - spin_lock_irq(&pidmap_lock); - if (!map->page) { - map->page = page; - page = NULL; - } - spin_unlock_irq(&pidmap_lock); - kfree(page); - if (unlikely(!map->page)) + if (unlikely(!map->page)) + if (alloc_pidmap_page(map) < 0) break; - } if (likely(atomic_read(&map->nr_free))) { do { if (!test_and_set_bit(offset, map->page)) { -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 02/11][v15]: Have alloc_pidmap() return actual error code [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 2010-07-03 20:32 ` [PATCH 01/11][v15]: Factor out code to allocate pidmap page Sukadev Bhattiprolu @ 2010-07-03 20:32 ` Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 03/11][v15]: Define set_pidmap() function Sukadev Bhattiprolu ` (8 subsequent siblings) 10 siblings, 0 replies; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath alloc_pidmap() can fail either because all pid numbers are in use or because memory allocation failed. With support for setting a specific pid number, alloc_pidmap() would also fail if either the given pid number is invalid or in use. Rather than have callers assume -ENOMEM, have alloc_pidmap() return the actual error. Changelog[v1]: - [Oren Laadan] Rebase to kernel 2.6.33 Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- kernel/fork.c | 5 +++-- kernel/pid.c | 10 ++++++---- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 44b0791..afdfb08 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1147,10 +1147,11 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != &init_struct_pid) { - retval = -ENOMEM; pid = alloc_pid(p->nsproxy->pid_ns); - if (!pid) + if (IS_ERR(pid)) { + retval = PTR_ERR(pid); goto bad_fork_cleanup_io; + } if (clone_flags & CLONE_NEWPID) { retval = pid_ns_prepare_proc(p->nsproxy->pid_ns); diff --git a/kernel/pid.c b/kernel/pid.c index 52a371a..8330488 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -160,7 +160,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) for (i = 0; i <= max_scan; ++i) { if (unlikely(!map->page)) if (alloc_pidmap_page(map) < 0) - break; + return -ENOMEM; if (likely(atomic_read(&map->nr_free))) { do { if (!test_and_set_bit(offset, map->page)) { @@ -191,7 +191,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) } pid = mk_pid(pid_ns, map, offset); } - return -1; + return -EBUSY; } int next_pidmap(struct pid_namespace *pid_ns, int last) @@ -260,8 +260,10 @@ struct pid *alloc_pid(struct pid_namespace *ns) struct upid *upid; pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); - if (!pid) + if (!pid) { + pid = ERR_PTR(-ENOMEM); goto out; + } tmp = ns; for (i = ns->level; i >= 0; i--) { @@ -295,7 +297,7 @@ out_free: free_pidmap(pid->numbers + i); kmem_cache_free(ns->pid_cachep, pid); - pid = NULL; + pid = ERR_PTR(nr); goto out; } -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 03/11][v15]: Define set_pidmap() function [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 2010-07-03 20:32 ` [PATCH 01/11][v15]: Factor out code to allocate pidmap page Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 02/11][v15]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu @ 2010-07-03 20:32 ` Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 04/11][v15]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu ` (7 subsequent siblings) 10 siblings, 0 replies; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath Define a set_pidmap() interface which is like alloc_pidmap() only that caller specifies the pid number to be assigned. Changelog[v13]: - Don't let do_alloc_pidmap return 0 if it failed to find a pid. Changelog[v9]: - Completely rewrote this patch based on Eric Biederman's code. Changelog[v7]: - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids. Changelog[v6]: - Separate target_pid > 0 case to minimize the number of checks needed. Changelog[v3]: - (Eric Biederman): Avoid set_pidmap() function. Added couple of checks for target_pid in alloc_pidmap() itself. Changelog[v2]: - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code actually checks for 'pid <= 0' for completeness). Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Signed-off-by: Sukadev Bhattiprolu <sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- kernel/pid.c | 41 +++++++++++++++++++++++++++++++++-------- 1 files changed, 33 insertions(+), 8 deletions(-) diff --git a/kernel/pid.c b/kernel/pid.c index 8330488..4eaf975 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -146,17 +146,18 @@ static int alloc_pidmap_page(struct pidmap *map) return 0; } -static int alloc_pidmap(struct pid_namespace *pid_ns) +static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min, + int max) { - int i, offset, max_scan, pid, last = pid_ns->last_pid; + int i, offset, max_scan, pid; struct pidmap *map; pid = last + 1; if (pid >= pid_max) - pid = RESERVED_PIDS; + pid = min; offset = pid & BITS_PER_PAGE_MASK; map = &pid_ns->pidmap[pid/BITS_PER_PAGE]; - max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; + max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset; for (i = 0; i <= max_scan; ++i) { if (unlikely(!map->page)) if (alloc_pidmap_page(map) < 0) @@ -165,7 +166,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) do { if (!test_and_set_bit(offset, map->page)) { atomic_dec(&map->nr_free); - pid_ns->last_pid = pid; return pid; } offset = find_next_offset(map, offset); @@ -176,16 +176,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) * bitmap block and the final block was the same * as the starting point, pid is before last_pid. */ - } while (offset < BITS_PER_PAGE && pid < pid_max && + } while (offset < BITS_PER_PAGE && pid < max && (i != max_scan || pid < last || !((last+1) & BITS_PER_PAGE_MASK))); } - if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) { + if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) { ++map; offset = 0; } else { map = &pid_ns->pidmap[0]; - offset = RESERVED_PIDS; + offset = min; if (unlikely(last == offset)) break; } @@ -194,6 +194,31 @@ static int alloc_pidmap(struct pid_namespace *pid_ns) return -EBUSY; } +static int alloc_pidmap(struct pid_namespace *pid_ns) +{ + int nr; + + nr = do_alloc_pidmap(pid_ns, pid_ns->last_pid, RESERVED_PIDS, pid_max); + if (nr >= 0) + pid_ns->last_pid = nr; + return nr; +} + +static int set_pidmap(struct pid_namespace *pid_ns, int target) +{ + if (!target) + return alloc_pidmap(pid_ns); + + if (target >= pid_max) + return -EINVAL; + + if ((target < 0) || (target < RESERVED_PIDS && + pid_ns->last_pid >= RESERVED_PIDS)) + return -EINVAL; + + return do_alloc_pidmap(pid_ns, target - 1, target, target + 1); +} + int next_pidmap(struct pid_namespace *pid_ns, int last) { int offset; -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 04/11][v15]: Add target_pids parameter to alloc_pid() [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> ` (2 preceding siblings ...) 2010-07-03 20:32 ` [PATCH 03/11][v15]: Define set_pidmap() function Sukadev Bhattiprolu @ 2010-07-03 20:32 ` Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 05/11][v15]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu ` (6 subsequent siblings) 10 siblings, 0 replies; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath This parameter is currently NULL, but will be used in a follow-on patch. Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- include/linux/pid.h | 2 +- kernel/fork.c | 3 ++- kernel/pid.c | 9 +++++++-- 3 files changed, 10 insertions(+), 4 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 49f1c2f..914185d 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr); extern struct pid *find_ge_pid(int nr, struct pid_namespace *); int next_pidmap(struct pid_namespace *pid_ns, int last); -extern struct pid *alloc_pid(struct pid_namespace *ns); +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids); extern void free_pid(struct pid *pid); /* diff --git a/kernel/fork.c b/kernel/fork.c index afdfb08..62018c8 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -962,6 +962,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, int retval; struct task_struct *p; int cgroup_callbacks_done = 0; + pid_t *target_pids = NULL; if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1147,7 +1148,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_io; if (pid != &init_struct_pid) { - pid = alloc_pid(p->nsproxy->pid_ns); + pid = alloc_pid(p->nsproxy->pid_ns, target_pids); if (IS_ERR(pid)) { retval = PTR_ERR(pid); goto bad_fork_cleanup_io; diff --git a/kernel/pid.c b/kernel/pid.c index 4eaf975..57f1344 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -276,13 +276,14 @@ void free_pid(struct pid *pid) call_rcu(&pid->rcu, delayed_put_pid); } -struct pid *alloc_pid(struct pid_namespace *ns) +struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids) { struct pid *pid; enum pid_type type; int i, nr; struct pid_namespace *tmp; struct upid *upid; + pid_t tpid; pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); if (!pid) { @@ -292,7 +293,11 @@ struct pid *alloc_pid(struct pid_namespace *ns) tmp = ns; for (i = ns->level; i >= 0; i--) { - nr = alloc_pidmap(tmp); + tpid = 0; + if (target_pids) + tpid = target_pids[i]; + + nr = set_pidmap(tmp, tpid); if (nr < 0) goto out_free; -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 05/11][v15]: Add target_pids parameter to copy_process() [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> ` (3 preceding siblings ...) 2010-07-03 20:32 ` [PATCH 04/11][v15]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu @ 2010-07-03 20:32 ` Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 06/11][v15]: Check invalid clone flags Sukadev Bhattiprolu ` (5 subsequent siblings) 10 siblings, 0 replies; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath Add a 'target_pids' parameter to copy_process(). The new parameter will be used in a follow-on patch when eclone() is implemented. Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- kernel/fork.c | 7 ++++--- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 62018c8..9d2b57e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -957,12 +957,12 @@ static struct task_struct *copy_process(unsigned long clone_flags, unsigned long stack_size, int __user *child_tidptr, struct pid *pid, + pid_t *target_pids, int trace) { int retval; struct task_struct *p; int cgroup_callbacks_done = 0; - pid_t *target_pids = NULL; if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); @@ -1339,7 +1339,7 @@ struct task_struct * __cpuinit fork_idle(int cpu) struct pt_regs regs; task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL, - &init_struct_pid, 0); + &init_struct_pid, NULL, 0); if (!IS_ERR(task)) init_idle(task, cpu); @@ -1362,6 +1362,7 @@ long do_fork(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; + pid_t *target_pids = NULL; /* * Do some preliminary argument and permissions checking before we @@ -1402,7 +1403,7 @@ long do_fork(unsigned long clone_flags, trace = tracehook_prepare_clone(clone_flags); p = copy_process(clone_flags, stack_start, regs, stack_size, - child_tidptr, NULL, trace); + child_tidptr, NULL, target_pids, trace); /* * Do this prior waking up the new thread - the thread pointer * might get invalid after that point, if the thread exits quickly. -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 06/11][v15]: Check invalid clone flags [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> ` (4 preceding siblings ...) 2010-07-03 20:32 ` [PATCH 05/11][v15]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu @ 2010-07-03 20:32 ` Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 07/11][v15]: Define do_fork_with_pids() Sukadev Bhattiprolu ` (4 subsequent siblings) 10 siblings, 0 replies; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath As pointed out by Oren Laadan, we want to ensure that unused bits in the clone-flags remain unused and available for future. To ensure this, define a mask of clone-flags and check the flags in the clone() system calls. Changelog[v9]: - Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS to avoid breaking any applications that may have set it. IOW, this patch/check only applies to clone-flags bits 33 and higher. Changelog[v8]: - New patch in set Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Acked-by: Oren Laadan <orenl.cs.columbia.edu> --- include/linux/sched.h | 12 ++++++++++++ kernel/fork.c | 3 +++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index dad7f66..5de3ce5 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -29,6 +29,18 @@ #define CLONE_NEWNET 0x40000000 /* New network namespace */ #define CLONE_IO 0x80000000 /* Clone io context */ +#define CLONE_UNUSED 0x00001000 /* Can be reused ? */ + +#define VALID_CLONE_FLAGS (CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\ + CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\ + CLONE_VFORK | CLONE_PARENT | CLONE_THREAD |\ + CLONE_NEWNS | CLONE_SYSVSEM | CLONE_SETTLS |\ + CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID |\ + CLONE_DETACHED | CLONE_UNTRACED |\ + CLONE_CHILD_SETTID | CLONE_STOPPED |\ + CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\ + CLONE_NEWPID | CLONE_NEWNET | CLONE_IO) + /* * Scheduling policies */ diff --git a/kernel/fork.c b/kernel/fork.c index 9d2b57e..e41b3d1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -964,6 +964,9 @@ static struct task_struct *copy_process(unsigned long clone_flags, struct task_struct *p; int cgroup_callbacks_done = 0; + if (clone_flags & ~VALID_CLONE_FLAGS) + return ERR_PTR(-EINVAL); + if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS)) return ERR_PTR(-EINVAL); -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 07/11][v15]: Define do_fork_with_pids() [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> ` (5 preceding siblings ...) 2010-07-03 20:32 ` [PATCH 06/11][v15]: Check invalid clone flags Sukadev Bhattiprolu @ 2010-07-03 20:32 ` Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 08/11][v15]: Implement sys_eclone for x86 (32,64) Sukadev Bhattiprolu ` (3 subsequent siblings) 10 siblings, 0 replies; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath do_fork_with_pids() is same as do_fork(), except that it takes an additional, 'pid_set', parameter. This parameter, currently unused, specifies the set of target pids of the process in each of its pid namespaces. Changelog[v7]: - Drop 'struct pid_set' object and pass in 'pid_t *target_pids' instead of 'struct pid_set *'. Changelog[v6]: - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds) Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set' is constant across architectures. - (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'. Changelog[v4]: - Rename 'struct target_pid_set' to 'struct pid_set' since it may be useful in other contexts. Changelog[v3]: - Fix "long-line" warning from checkpatch.pl Changelog[v2]: - To facilitate moving architecture-inpdendent code to kernel/fork.c pass in 'struct target_pid_set __user *' to do_fork_with_pids() rather than 'pid_t *' (next patch moves the arch-independent code to kernel/fork.c) Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Cc: Oleg Nesterov <oleg-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Reviewed-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- include/linux/sched.h | 3 +++ kernel/fork.c | 17 +++++++++++++++-- 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 5de3ce5..f4ae3e3 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2129,6 +2129,9 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, + unsigned long, int __user *, int __user *, + unsigned int, pid_t __user *); struct task_struct *fork_idle(int); extern void set_task_comm(struct task_struct *tsk, char *from); diff --git a/kernel/fork.c b/kernel/fork.c index e41b3d1..2559d7a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1355,12 +1355,14 @@ struct task_struct * __cpuinit fork_idle(int cpu) * It copies the process, and if successful kick-starts * it and waits for it to finish using the VM if required. */ -long do_fork(unsigned long clone_flags, +long do_fork_with_pids(unsigned long clone_flags, unsigned long stack_start, struct pt_regs *regs, unsigned long stack_size, int __user *parent_tidptr, - int __user *child_tidptr) + int __user *child_tidptr, + unsigned int num_pids, + pid_t __user *upids) { struct task_struct *p; int trace = 0; @@ -1463,6 +1465,17 @@ long do_fork(unsigned long clone_flags, return nr; } +long do_fork(unsigned long clone_flags, + unsigned long stack_start, + struct pt_regs *regs, + unsigned long stack_size, + int __user *parent_tidptr, + int __user *child_tidptr) +{ + return do_fork_with_pids(clone_flags, stack_start, regs, stack_size, + parent_tidptr, child_tidptr, 0, NULL); +} + #ifndef ARCH_MIN_MMSTRUCT_ALIGN #define ARCH_MIN_MMSTRUCT_ALIGN 0 #endif -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 08/11][v15]: Implement sys_eclone for x86 (32,64) [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> ` (6 preceding siblings ...) 2010-07-03 20:32 ` [PATCH 07/11][v15]: Define do_fork_with_pids() Sukadev Bhattiprolu @ 2010-07-03 20:32 ` Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 09/11][v15]: Implement sys_eclone for s390 Sukadev Bhattiprolu ` (2 subsequent siblings) 10 siblings, 0 replies; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath Container restart requires that a task have the same pid it had when it was checkpointed. When containers are nested the tasks within the containers exist in multiple pid namespaces and hence have multiple pids to specify during restart. eclone(), intended for use during restart, is the same as clone(), except that it takes a 'pids' paramter. This parameter lets caller choose specific pid numbers for the child process, in the process's active and ancestor pid namespaces. (Descendant pid namespaces in general don't matter since processes don't have pids in them anyway, but see comments in copy_target_pids() regarding CLONE_NEWPID). eclone() also attempts to address a second limitation of the clone() system call. clone() is restricted to 32 clone flags and all but one of these are in use. If more new clone flags are needed, we will be forced to define a new variant of the clone() system call. To address this, eclone() allows at least 64 clone flags with some room for more if necessary. To prevent unprivileged processes from misusing this interface, eclone() currently needs CAP_SYS_ADMIN, when the 'pids' parameter is non-NULL. See Documentation/eclone in next patch for more details and an example of its usage. NOTE: - System calls are restricted to 6 parameters and the number and sizes of parameters needed for eclone() exceed 6 integers. The new prototype works around this restriction while providing some flexibility if eclone() needs to be further extended in the future. TODO: - We should convert clone-flags to 64-bit value in all architectures. Its probably best to do that as a separate patchset since clone_flags touches several functions and that patchset seems independent of this new system call. Changelog[v15]: - [Albert Cahalan, Randy Dunlap] Specify stack as [base, size] so user-space code is same for all architectures (rename ->child_stack to ->child_stack_base; compute stack pointer in kernel) - [Russel King]: Specifying '__user' for integer type is redundant. Changelog[v14]: - [Oren Laadan] Rebase to kernel 2.6.33 * introduce PTREGSCALL4 for sys_eclone * consolidate syscall definitions for 32/64 bit - [Oren Laadan] Merge x86_64 (trivial patch) with current - [Serge Hallyn] Add eclone stub for ia32 eclone Changelog[v13]: - [Dave Hansen]: Reorg to enable sharing code between x86 and x86-64. - [Arnd Bergmann]: With args_size parameter, ->reserved1 is redundant and can be removed. - [Nathan Lynch]: stop warnings about assigning u64 to a (32-bit) int*. - [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to ->child_stack and ensure ->child_stack_size is 0 on architectures that don't need it (see comments in types.h for details). Changelog[v12]: - [Serge Hallyn] Ignore ->child_stack_size if ->child_stack_base is NULL. - [Oren Laadan, Serge Hallyn] Rename clone_with_pids() to eclone() Changelog[v11]: - [Dave Hansen] Move clone_args validation checks to arch-indpeendent code. - [Oren Laadan] Make args_size a parameter to system call and remove it from 'struct clone_args' Changelog[v10]: - Rename clone3() to clone_with_pids() - [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall implementation Changelog[v9]: - [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit architectures split the new clone-flags into 'low' and 'high' words and pass in the 'lower' flags as the first argument. This would maintain similarity of the clone3() with clone()/ clone2(). Also has the side-effect of the name matching the number of parameters :-) - [Roland McGrath] Rename structure to 'clone_args' and add a 'child_stack_size' field Changelog[v8] - [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg' must be 64-bit. - clone2() is in use in IA64. Rename system call to clone3(). Changelog[v7]: - [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2() and group parameters into a new 'struct clone_struct' object. Changelog[v6]: - (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds) Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set' is constant across architectures. - (Nathan Lynch) Change pid_set.num_pids to unsigned and remove 'unum_pids < 0' check. Changelog[v4]: - (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set' Changelog[v3]: - (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid in the target_pids[] list and setting it 0. See copy_target_pids()). - (Oren Laadan) Specified target pids should apply only to youngest pid-namespaces (see copy_target_pids()) - (Matt Helsley) Update patch description. Changelog[v2]: - Remove unnecessary printk and add a note to callers of copy_target_pids() to free target_pids. - (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description. - (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and 'num_pids == 0' (fall back to normal clone()). - Move arch-independent code (sanity checks and copy-in of target-pids) into kernel/fork.c and simplify sys_clone_with_pids() Changelog[v1]: - Fixed some compile errors (had fixed these errors earlier in my git tree but had not refreshed patches before emailing them) Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Tested-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Acked-by: Oren Laadan <orenl.cs.columbia.edu> --- arch/x86/ia32/ia32entry.S | 2 + arch/x86/include/asm/syscalls.h | 2 + arch/x86/include/asm/unistd_32.h | 3 +- arch/x86/include/asm/unistd_64.h | 2 + arch/x86/kernel/entry_32.S | 14 ++++ arch/x86/kernel/entry_64.S | 1 + arch/x86/kernel/process.c | 43 ++++++++++++- arch/x86/kernel/syscall_table_32.S | 1 + include/linux/sched.h | 2 + include/linux/types.h | 10 +++ kernel/fork.c | 124 +++++++++++++++++++++++++++++++++++- 11 files changed, 201 insertions(+), 3 deletions(-) diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 59b4556..b7f3f34 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -477,6 +477,7 @@ quiet_ni_syscall: PTREGSCALL stub32_clone, sys32_clone, %rdx PTREGSCALL stub32_vfork, sys_vfork, %rdi PTREGSCALL stub32_iopl, sys_iopl, %rsi + PTREGSCALL stub32_eclone, sys_eclone, %r8 ENTRY(ia32_ptregs_common) popq %r11 @@ -842,4 +843,5 @@ ia32_sys_call_table: .quad compat_sys_rt_tgsigqueueinfo /* 335 */ .quad sys_perf_event_open .quad compat_sys_recvmmsg + .quad stub32_eclone ia32_syscall_end: diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h index 5c044b4..d525677 100644 --- a/arch/x86/include/asm/syscalls.h +++ b/arch/x86/include/asm/syscalls.h @@ -27,6 +27,8 @@ long sys_execve(char __user *, char __user * __user *, char __user * __user *, struct pt_regs *); long sys_clone(unsigned long, unsigned long, void __user *, void __user *, struct pt_regs *); +long sys_eclone(unsigned flags_low, struct clone_args __user *uca, + int args_size, pid_t __user *pids, struct pt_regs *regs); /* kernel/ldt.c */ asmlinkage int sys_modify_ldt(int, void __user *, unsigned long); diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h index beb9b5f..e543b0e 100644 --- a/arch/x86/include/asm/unistd_32.h +++ b/arch/x86/include/asm/unistd_32.h @@ -343,10 +343,11 @@ #define __NR_rt_tgsigqueueinfo 335 #define __NR_perf_event_open 336 #define __NR_recvmmsg 337 +#define __NR_eclone 338 #ifdef __KERNEL__ -#define NR_syscalls 338 +#define NR_syscalls 339 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h index ff4307b..1cd16af 100644 --- a/arch/x86/include/asm/unistd_64.h +++ b/arch/x86/include/asm/unistd_64.h @@ -663,6 +663,8 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo) __SYSCALL(__NR_perf_event_open, sys_perf_event_open) #define __NR_recvmmsg 299 __SYSCALL(__NR_recvmmsg, sys_recvmmsg) +#define __NR_eclone 300 +__SYSCALL(__NR_eclone, stub_eclone) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 44a8e0d..65e1735 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -758,6 +758,19 @@ ptregs_##name: \ addl $4,%esp; \ ret +#define PTREGSCALL4(name) \ + ALIGN; \ +ptregs_##name: \ + leal 4(%esp),%eax; \ + pushl %eax; \ + pushl PT_ESI(%eax); \ + movl PT_EDX(%eax),%ecx; \ + movl PT_ECX(%eax),%edx; \ + movl PT_EBX(%eax),%eax; \ + call sys_##name; \ + addl $8,%esp; \ + ret + PTREGSCALL1(iopl) PTREGSCALL0(fork) PTREGSCALL0(vfork) @@ -767,6 +780,7 @@ PTREGSCALL0(sigreturn) PTREGSCALL0(rt_sigreturn) PTREGSCALL2(vm86) PTREGSCALL1(vm86old) +PTREGSCALL4(eclone) /* Clone is an oddball. The 4th arg is in %edi */ ALIGN; diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 0697ff1..216681e 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -698,6 +698,7 @@ END(\label) PTREGSCALL stub_vfork, sys_vfork, %rdi PTREGSCALL stub_sigaltstack, sys_sigaltstack, %rdx PTREGSCALL stub_iopl, sys_iopl, %rsi + PTREGSCALL stub_eclone, sys_eclone, %r8 ENTRY(ptregscall_common) DEFAULT_FRAME 1 8 /* offset 8: return address */ diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 28ad9f4..c905f45 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -259,6 +259,48 @@ sys_clone(unsigned long clone_flags, unsigned long newsp, return do_fork(clone_flags, newsp, regs, 0, parent_tid, child_tid); } +long +sys_eclone(unsigned flags_low, struct clone_args __user *uca, + int args_size, pid_t __user *pids, struct pt_regs *regs) +{ + int rc; + struct clone_args kca; + unsigned long flags; + int __user *parent_tidp; + int __user *child_tidp; + unsigned long stack; + unsigned long stack_size; + unsigned long stack_base; + + rc = fetch_clone_args_from_user(uca, args_size, &kca); + if (rc) + return rc; + + /* + * TODO: Convert 'clone-flags' to 64-bits on all architectures. + * TODO: When ->clone_flags_high is non-zero, copy it in to the + * higher word(s) of 'flags': + * + * flags = (kca.clone_flags_high << 32) | flags_low; + */ + flags = flags_low; + parent_tidp = (int *)(unsigned long)kca.parent_tid_ptr; + child_tidp = (int *)(unsigned long)kca.child_tid_ptr; + stack_base = (unsigned long)kca.child_stack_base; + stack_size = (unsigned long)kca.child_stack_size; + + /* + * Compute stack pointer and adjust for function-arg and return + * address values on the stack. + */ + stack = regs->sp; + if (stack_base) + stack = stack_base + stack_size - 2*sizeof(void *); + + return do_fork_with_pids(flags, stack, regs, 0, parent_tidp, + child_tidp, kca.nr_pids, pids); +} + /* * This gets run with %si containing the * function to call, and %di containing @@ -700,4 +742,3 @@ unsigned long arch_randomize_brk(struct mm_struct *mm) unsigned long range_end = mm->brk + 0x02000000; return randomize_range(mm->brk, range_end, 0) ? : mm->brk; } - diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index 8b37293..0c92570 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -337,3 +337,4 @@ ENTRY(sys_call_table) .long sys_rt_tgsigqueueinfo /* 335 */ .long sys_perf_event_open .long sys_recvmmsg + .long ptregs_eclone diff --git a/include/linux/sched.h b/include/linux/sched.h index f4ae3e3..8593051 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2129,6 +2129,8 @@ extern int disallow_signal(int); extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *); extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *); +extern int fetch_clone_args_from_user(struct clone_args __user *, int, + struct clone_args *); extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, unsigned int, pid_t __user *); diff --git a/include/linux/types.h b/include/linux/types.h index c42724f..25d1c62 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -204,6 +204,16 @@ struct ustat { char f_fpack[6]; }; +struct clone_args { + u64 clone_flags_high; + u64 child_stack_base; + u64 child_stack_size; + u64 parent_tid_ptr; + u64 child_tid_ptr; + u32 nr_pids; + u32 reserved0; +}; + #endif /* __KERNEL__ */ #endif /* __ASSEMBLY__ */ #endif /* _LINUX_TYPES_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 2559d7a..9d5be5c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1350,6 +1350,114 @@ struct task_struct * __cpuinit fork_idle(int cpu) } /* + * If user specified any 'target-pids' in @upid_setp, copy them from + * user and return a pointer to a local copy of the list of pids. The + * caller must free the list, when they are done using it. + * + * If user did not specify any target pids, return NULL (caller should + * treat this like normal clone). + * + * On any errors, return the error code + */ +static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids) +{ + int j; + int rc; + int size; + int knum_pids; /* # of pids needed in kernel */ + pid_t *target_pids; + + if (!unum_pids) + return NULL; + + knum_pids = task_pid(current)->level + 1; + if (unum_pids > knum_pids) + return ERR_PTR(-EINVAL); + + /* + * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[] + * and set it to 0. This last entry in target_pids[] corresponds to the + * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was + * specified. If CLONE_NEWPID was not specified, this last entry will + * simply be ignored. + */ + target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL); + if (!target_pids) + return ERR_PTR(-ENOMEM); + + /* + * A process running in a level 2 pid namespace has three pid namespaces + * and hence three pid numbers. If this process is checkpointed, + * information about these three namespaces are saved. We refer to these + * namespaces as 'known namespaces'. + * + * If this checkpointed process is however restarted in a level 3 pid + * namespace, the restarted process has an extra ancestor pid namespace + * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'. + * + * During restart, the process requests specific pids for its 'known + * namespaces' and lets kernel assign pids to its 'unknown namespaces'. + * + * Since the requested-pids correspond to 'known namespaces' and since + * 'known-namespaces' are younger than (i.e descendants of) 'unknown- + * namespaces', copy requested pids to the back-end of target_pids[] + * (i.e before the last entry for CLONE_NEWPID mentioned above). + * Any entries in target_pids[] not corresponding to a requested pid + * will be set to zero and kernel assigns a pid in those namespaces. + * + * NOTE: The order of pids in target_pids[] is oldest pid namespace + * to youngest (target_pids[0] corresponds to init_pid_ns). i.e. the + * the order is: + * + * - pids for 'unknown-namespaces' (if any) + * - pids for 'known-namespaces' (requested pids) + * - 0 in the last entry (for CLONE_NEWPID). + */ + j = knum_pids - unum_pids; + size = unum_pids * sizeof(pid_t); + + rc = copy_from_user(&target_pids[j], upids, size); + if (rc) { + rc = -EFAULT; + goto out_free; + } + + return target_pids; + +out_free: + kfree(target_pids); + return ERR_PTR(rc); +} + +int +fetch_clone_args_from_user(struct clone_args __user *uca, int args_size, + struct clone_args *kca) +{ + int rc; + + /* + * TODO: If size of clone_args is not what the kernel expects, it + * could be that kernel is newer and has an extended structure. + * When that happens, this check needs to be smarter. For now, + * assume exact match. + */ + if (args_size != sizeof(struct clone_args)) + return -EINVAL; + + rc = copy_from_user(kca, uca, args_size); + if (rc) + return -EFAULT; + + /* + * To avoid future compatibility issues, ensure unused fields are 0. + */ + if (kca->reserved0 || kca->clone_flags_high) + return -EINVAL; + + return 0; +} + +/* * Ok, this is the main fork-routine. * * It copies the process, and if successful kick-starts @@ -1367,7 +1475,7 @@ long do_fork_with_pids(unsigned long clone_flags, struct task_struct *p; int trace = 0; long nr; - pid_t *target_pids = NULL; + pid_t *target_pids; /* * Do some preliminary argument and permissions checking before we @@ -1401,6 +1509,16 @@ long do_fork_with_pids(unsigned long clone_flags, } } + target_pids = copy_target_pids(num_pids, upids); + if (target_pids) { + if (IS_ERR(target_pids)) + return PTR_ERR(target_pids); + + nr = -EPERM; + if (!capable(CAP_SYS_ADMIN)) + goto out_free; + } + /* * When called from kernel_thread, don't do user tracing stuff. */ @@ -1462,6 +1580,10 @@ long do_fork_with_pids(unsigned long clone_flags, } else { nr = PTR_ERR(p); } + +out_free: + kfree(target_pids); + return nr; } -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 09/11][v15]: Implement sys_eclone for s390 [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> ` (7 preceding siblings ...) 2010-07-03 20:32 ` [PATCH 08/11][v15]: Implement sys_eclone for x86 (32,64) Sukadev Bhattiprolu @ 2010-07-03 20:32 ` Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 10/11][v15]: Implement sys_eclone for powerpc Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 11/11][v15]: Document sys_eclone Sukadev Bhattiprolu 10 siblings, 0 replies; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Implement the s390 hook for sys_eclone(). Changelog: Jun 23: [Albert Cahalan, Randy Dunlap] Specify stack as [base,offset] [Russel King] Don't specify '__user' for integer types Nov 24: Removed user-space code from commit log. See user-cr git tree. Nov 17: remove redundant flags_high check Nov 13: As suggested by Heiko, convert eclone to take its parameters via registers. Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- arch/s390/include/asm/unistd.h | 3 +- arch/s390/kernel/compat_linux.c | 17 ++++++++++++++++ arch/s390/kernel/compat_wrapper.S | 8 +++++++ arch/s390/kernel/process.c | 39 +++++++++++++++++++++++++++++++++++++ arch/s390/kernel/syscalls.S | 1 + 5 files changed, 67 insertions(+), 1 deletions(-) diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h index 5f00751..ff13be1 100644 --- a/arch/s390/include/asm/unistd.h +++ b/arch/s390/include/asm/unistd.h @@ -269,7 +269,8 @@ #define __NR_pwritev 329 #define __NR_rt_tgsigqueueinfo 330 #define __NR_perf_event_open 331 -#define NR_syscalls 332 +#define __NR_eclone 332 +#define NR_syscalls 333 /* * There are some system calls that are not present on 64 bit, some diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c index 73b624e..1f70d6f 100644 --- a/arch/s390/kernel/compat_linux.c +++ b/arch/s390/kernel/compat_linux.c @@ -663,6 +663,23 @@ asmlinkage long sys32_write(unsigned int fd, char __user * buf, size_t count) return sys_write(fd, buf, count); } +asmlinkage long sys32_clone(void) +{ + struct pt_regs *regs = task_pt_regs(current); + unsigned long clone_flags; + unsigned long newsp; + int __user *parent_tidptr, *child_tidptr; + + clone_flags = regs->gprs[3] & 0xffffffffUL; + newsp = regs->orig_gpr2 & 0x7fffffffUL; + parent_tidptr = compat_ptr(regs->gprs[4]); + child_tidptr = compat_ptr(regs->gprs[5]); + if (!newsp) + newsp = regs->gprs[15]; + return do_fork(clone_flags, newsp, regs, 0, + parent_tidptr, child_tidptr); +} + /* * 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64. * These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE} diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S index 672ce52..b7bedfa 100644 --- a/arch/s390/kernel/compat_wrapper.S +++ b/arch/s390/kernel/compat_wrapper.S @@ -1847,6 +1847,14 @@ sys_clone_wrapper: llgtr %r5,%r5 # int * jg sys_clone # branch to system call + .globl sys_eclone_wrapper +sys_eclone_wrapper: + llgfr %r2,%r2 # unsigned int + llgtr %r3,%r3 # struct clone_args * + lgfr %r4,%r4 # int + llgtr %r5,%r5 # pid_t * + jg sys_eclone # branch to system call + .globl sys32_execve_wrapper sys32_execve_wrapper: llgtr %r2,%r2 # char * diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c index 1039fde..e511122 100644 --- a/arch/s390/kernel/process.c +++ b/arch/s390/kernel/process.c @@ -240,6 +240,45 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags, parent_tidptr, child_tidptr); } +SYSCALL_DEFINE4(eclone, unsigned int, flags_low, struct clone_args __user *, + uca, int, args_size, pid_t __user *, pids) +{ + int rc; + struct pt_regs *regs = task_pt_regs(current); + struct clone_args kca; + int __user *parent_tid_ptr; + int __user *child_tid_ptr; + unsigned long flags; + unsigned long stack; + unsigned long stack_base; + unsigned long stack_size; + + rc = fetch_clone_args_from_user(uca, args_size, &kca); + if (rc) + return rc; + + flags = flags_low; + parent_tid_ptr = (int __user *) kca.parent_tid_ptr; + child_tid_ptr = (int __user *) kca.child_tid_ptr; + stack_base = (unsigned long) kca.child_stack_base; + stack_size = (unsigned long) kca.child_stack_size; + + /* + * Compute stack pointer and adjust for function-arg and return + * adderss values on the stack. + */ + stack = regs->gprs[15]; + if (!stack) + stack = stack_base + stack_size - 2*sizeof(void *); + + /* + * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value + * to several functions. Need to convert clone_flags to 64-bit. + */ + return do_fork_with_pids(flags, stack, regs, 0, parent_tid_ptr, + child_tid_ptr, kca.nr_pids, pids); +} + /* * This is trivial, and on the face of it looks like it * could equally well be done in user mode. diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S index 201ce6b..08eab1d 100644 --- a/arch/s390/kernel/syscalls.S +++ b/arch/s390/kernel/syscalls.S @@ -340,3 +340,4 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper) SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper) SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */ SYSCALL(sys_perf_event_open,sys_perf_event_open,sys_perf_event_open_wrapper) +SYSCALL(sys_eclone,sys_eclone,sys_eclone_wrapper) -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 10/11][v15]: Implement sys_eclone for powerpc [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> ` (8 preceding siblings ...) 2010-07-03 20:32 ` [PATCH 09/11][v15]: Implement sys_eclone for s390 Sukadev Bhattiprolu @ 2010-07-03 20:32 ` Sukadev Bhattiprolu 2010-07-03 20:32 ` [PATCH 11/11][v15]: Document sys_eclone Sukadev Bhattiprolu 10 siblings, 0 replies; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath From: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> Wired up for both ppc32 and ppc64, but tested only with the latter. Changelog: - Jun 21: (Sukadev): Specify stack as [base, offset]. - Jan 20: (ntl) fix 32-bit build - Nov 17: (serge) remove redundant flags_high check, and don't fold it into flags. Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Signed-off-by: Nathan Lynch <ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> --- arch/powerpc/include/asm/syscalls.h | 6 +++ arch/powerpc/include/asm/systbl.h | 1 + arch/powerpc/include/asm/unistd.h | 3 +- arch/powerpc/kernel/entry_32.S | 8 ++++ arch/powerpc/kernel/entry_64.S | 5 +++ arch/powerpc/kernel/process.c | 62 ++++++++++++++++++++++++++++++++++- 6 files changed, 83 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index 4084e56..920cefd 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -23,6 +23,12 @@ asmlinkage int sys_execve(unsigned long a0, unsigned long a1, asmlinkage int sys_clone(unsigned long clone_flags, unsigned long usp, int __user *parent_tidp, void __user *child_threadptr, int __user *child_tidp, int p6, struct pt_regs *regs); +asmlinkage int sys_eclone(unsigned long flags_low, + struct clone_args __user *args, + size_t args_size, + pid_t __user *pids, + unsigned long p5, unsigned long p6, + struct pt_regs *regs); asmlinkage int sys_fork(unsigned long p1, unsigned long p2, unsigned long p3, unsigned long p4, unsigned long p5, unsigned long p6, struct pt_regs *regs); diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index a5ee345..f94fc43 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -326,3 +326,4 @@ SYSCALL_SPU(perf_event_open) COMPAT_SYS_SPU(preadv) COMPAT_SYS_SPU(pwritev) COMPAT_SYS(rt_tgsigqueueinfo) +PPC_SYS(eclone) diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index f0a1026..4cdbd5c 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -345,10 +345,11 @@ #define __NR_preadv 320 #define __NR_pwritev 321 #define __NR_rt_tgsigqueueinfo 322 +#define __NR_eclone 323 #ifdef __KERNEL__ -#define __NR_syscalls 323 +#define __NR_syscalls 324 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index 1175a85..579f1da 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -586,6 +586,14 @@ ppc_clone: stw r0,_TRAP(r1) /* register set saved */ b sys_clone + .globl ppc_eclone +ppc_eclone: + SAVE_NVGPRS(r1) + lwz r0,_TRAP(r1) + rlwinm r0,r0,0,0,30 /* clear LSB to indicate full */ + stw r0,_TRAP(r1) /* register set saved */ + b sys_eclone + .globl ppc_swapcontext ppc_swapcontext: SAVE_NVGPRS(r1) diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 07109d8..b763340 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -344,6 +344,11 @@ _GLOBAL(ppc_clone) bl .sys_clone b syscall_exit +_GLOBAL(ppc_eclone) + bl .save_nvgprs + bl .sys_eclone + b syscall_exit + _GLOBAL(ppc32_swapcontext) bl .save_nvgprs bl .compat_sys_swapcontext diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index e4d71ce..f12a805 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -961,7 +961,67 @@ int sys_clone(unsigned long clone_flags, unsigned long usp, child_tidp = TRUNC_PTR(child_tidp); } #endif - return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp); + return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp); +} + +#ifdef CONFIG_PPC64 +#define MIN_STACK_FRAME 48 +#else +#define MIN_STACK_FRAME 16 +#endif + +static inline unsigned long get_user_stack_pointer(unsigned long base, + unsigned long size) +{ + return ((base + size - 1) & ~0xF) - MIN_STACK_FRAME; +} + +int sys_eclone(unsigned long clone_flags_low, + struct clone_args __user *uclone_args, + size_t size, + pid_t __user *upids, + unsigned long p5, unsigned long p6, + struct pt_regs *regs) +{ + struct clone_args kclone_args; + unsigned long stack_base; + int __user *parent_tidp; + int __user *child_tidp; + unsigned long stack_sz; + unsigned int nr_pids; + unsigned long flags; + unsigned long usp; + int rc; + + CHECK_FULL_REGS(regs); + + rc = fetch_clone_args_from_user(uclone_args, size, &kclone_args); + if (rc) + return rc; + + stack_sz = kclone_args.child_stack_size; + stack_base = kclone_args.child_stack_base; + + /* Interpret stack_base as the child sp if it is set. */ + usp = regs->gpr[1]; + if (stack_base) + usp = get_user_stack_pointer(stack_base, stack_sz); + + flags = clone_flags_low; + + nr_pids = kclone_args.nr_pids; + + parent_tidp = (int __user *)(unsigned long)kclone_args.parent_tid_ptr; + child_tidp = (int __user *)(unsigned long)kclone_args.child_tid_ptr; + +#ifdef CONFIG_PPC64 + if (test_thread_flag(TIF_32BIT)) { + parent_tidp = TRUNC_PTR(parent_tidp); + child_tidp = TRUNC_PTR(child_tidp); + } +#endif + return do_fork_with_pids(flags, usp, regs, 0UL, parent_tidp, + child_tidp, nr_pids, upids); } int sys_fork(unsigned long p1, unsigned long p2, unsigned long p3, -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH 11/11][v15]: Document sys_eclone [not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> ` (9 preceding siblings ...) 2010-07-03 20:32 ` [PATCH 10/11][v15]: Implement sys_eclone for powerpc Sukadev Bhattiprolu @ 2010-07-03 20:32 ` Sukadev Bhattiprolu [not found] ` <1278189164-28408-12-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> 10 siblings, 1 reply; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-03 20:32 UTC (permalink / raw) To: Oren Laadan, Serge Hallyn Cc: Randy Dunlap, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath This gives a brief overview of the eclone() system call. We should eventually describe more details in existing clone(2) man page or in a new man page. Changelog[v15]: - [Albert Cahalan, Randy Dunlap]: Specify stack as [base,offset] on all architectures. - [Randy Dunlap]: Fix typos and add a pointer to user-cr for usage examples of eclone() on architectures besides x86. Changelog[v13]: - [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to ->child_stack and ensure ->child_stack_size is 0 on architectures that don't need it. - [Arnd Bergmann] Remove ->reserved1 field - [Louis Rilling, Dave Hansen] Combine the two asm statements in the example into one and use memory constraint to avoid unncessary copies. Changelog[v12]: - [Serge Hallyn] Fix/simplify stack-setup in the example code - [Serge Hallyn, Oren Laadan] Rename syscall to eclone() Changelog[v11]: - [Dave Hansen] Move clone_args validation checks to arch-indpendent code. - [Oren Laadan] Make args_size a parameter to system call and remove it from 'struct clone_args' - [Oren Laadan] Fix some typos and clarify the order of pids in the @pids parameter. Changelog[v10]: - Rename clone3() to clone_with_pids() and fix some typos. - Modify example to show usage with the ptregs implementation. Changelog[v9]: - [Pavel Machek]: Fix an inconsistency and rename new file to Documentation/clone3. - [Roland McGrath, H. Peter Anvin] Updates to description and example to reflect new prototype of clone3() and the updated/ renamed 'struct clone_args'. Changelog[v8]: - clone2() is already in use in IA64. Rename syscall to clone3() - Add notes to say that we return -EINVAL if invalid clone flags are specified or if the reserved fields are not 0. Changelog[v7]: - Rename clone_with_pids() to clone2() - Changes to reflect new prototype of clone2() (using clone_struct). Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: x86-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org Cc: linux-s390-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: linuxppc-dev-mnsaURCQ41sdnm+yROfE0A@public.gmane.org Signed-off-by: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> Acked-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> Acked-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- Documentation/eclone | 354 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 354 insertions(+), 0 deletions(-) create mode 100644 Documentation/eclone diff --git a/Documentation/eclone b/Documentation/eclone new file mode 100644 index 0000000..25db2f5 --- /dev/null +++ b/Documentation/eclone @@ -0,0 +1,354 @@ + +struct clone_args { + u64 clone_flags_high; + u64 child_stack_base; + u64 child_stack_size; + u64 parent_tid_ptr; + u64 child_tid_ptr; + u32 nr_pids; + u32 reserved0; +}; + + +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, + pid_t * __user pids) + + In addition to doing everything that the clone() system call does, the + eclone() system call: + + - allows additional clone flags (31 of 32 bits in the flags + parameter to clone() are in use) + + - allows user to specify a pid for the child process in its + active and ancestor pid namespaces. + + This system call is meant to be used when restarting an application + from a checkpoint. Such restart requires that the processes in the + application have the same pids they had when the application was + checkpointed. When containers are nested, the processes within the + containers exist in multiple pid namespaces and hence have multiple + pids to specify during restart. + + The @flags_low parameter is identical to the 'clone_flags' parameter + in the existing clone() system call. + + The fields in 'struct clone_args' are meant to be used as follows: + + u64 clone_flags_high: + + When eclone() supports more than 32 flags, the additional bits + in the clone_flags should be specified in this field. This + field is currently unused and must be set to 0. + + u64 child_stack_base; + u64 child_stack_size; + + These two fields describe the region to be used for the child + stack. + + Note that this specification of the stack is similar to the + clone2() system call (used on IA64) but different from clone() + system call. The clone() system call specifies the stack pointer + as a parameter, but eclone() and clone2() describe stack as + [base, offset]. + + u64 parent_tid_ptr; + u64 child_tid_ptr; + + These two fields correspond to the 'parent_tid_ptr' and + 'child_tid_ptr' fields in the clone() system call. + + u32 nr_pids; + + nr_pids specifies the number of pids in the @pids array + parameter to eclone() (see below). nr_pids should not exceed + the current nesting level of the calling process (i.e. if the + process is in init_pid_ns, nr_pids must be 1, if process is + in a pid namespace that is a child of init-pid-ns, nr_pids + cannot exceed 2, and so on). + + u32 reserved0; + + This field is intended to extend the functionality of the + eclone() in the future, while preserving backward compatibility. + It must be set to 0 for now. + + The @cargs_size parameter specifes the sizeof(struct clone_args) and + is intended to enable extending this structure in the future, while + preserving backward compatibility. For now, this field must be set + to the sizeof(struct clone_args) and this size must match the kernel's + view of the structure. + + The @pids parameter defines the set of pids that should be assigned to + the child process in its active and ancestor pid namespaces. The + descendant pid namespaces do not matter since a process does not have a + pid in descendant namespaces, unless the process is in a new pid + namespace in which case the process is a container-init (and must have + the pid 1 in that namespace). + + See CLONE_NEWPID section of the clone(2) man page for details about pid + namespaces. + + If a pid in the @pids list is 0, the kernel will assign the next + available pid in the pid namespace. + + If a pid in the @pids list is non-zero, the kernel tries to assign + the specified pid in that namespace. If that pid is already in use + by another process, the system call fails (see EBUSY below). + + The order of pids in @pids is oldest in pids[0] to youngest pid + namespace in pids[nr_pids-1]. If the number of pids specified in the + @pids list is fewer than the nesting level of the process, the pids + are applied from youngest namespace. I.e if the process is nested in + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to + have a pid of '0' (the kernel will assign a pid in those namespaces). + + On success, the system call returns the pid of the child process in + the parent's active pid namespace. + + On failure, eclone() returns -1 and sets 'errno' to one of following + values (the child process is not created). + + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to + specify the pids in this call (if pids are not specifed + CAP_SYS_ADMIN is not required). + + EINVAL The number of pids specified in 'clone_args.nr_pids' exceeds + the current nesting level of parent process. + + EINVAL Not all specified clone-flags are valid. + + EINVAL The reserved fields in the clone_args argument are not 0. + + EINVAL The child_stack_size field is not 0 (on architectures that + pass in a stack pointer in ->child_stack field). + + EBUSY A requested pid is in use by another process in that namespace. + +Following shows an example usage of eclone() on x86. To build/use eclone() with +other supported architectures (x86_64, ppc and s390), see the clone*[hcS] files +in the following git-tree. + + git://git.ncl.cs.columbia.edu/pub/git/user-cr.git + +The Makefile in the top-level directory builds a 'libeclone.a' which implements +the eclone() interface for the appropriate architecture. + +--- +/* + * Example eclone() usage - Create a child process with pid CHILD_TID1 in + * the current pid namespace. The child gets the usual "random" pid in any + * ancestor pid namespaces. + */ +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <signal.h> +#include <errno.h> +#include <unistd.h> +#include <wait.h> +#include <sys/syscall.h> + +#define __NR_eclone 338 +#define CLONE_NEWPID 0x20000000 +#define CLONE_CHILD_SETTID 0x01000000 +#define CLONE_PARENT_SETTID 0x00100000 +#define CLONE_UNUSED 0x00001000 + +#define STACKSIZE 8192 + +typedef unsigned long long u64; +typedef unsigned int u32; +typedef int pid_t; +struct clone_args { + u64 clone_flags_high; + u64 child_stack_base; + u64 child_stack_size; + + u64 parent_tid_ptr; + u64 child_tid_ptr; + + u32 nr_pids; + + u32 reserved0; +}; + +#define exit _exit + +/* + * Following eclone() is based on code posted by Oren Laadan at: + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html + */ +#if defined(__i386__) && defined(__NR_eclone) + +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size, + int *pids) +{ + long retval; + + __asm__ __volatile__( + "movl %3, %%ebx\n\t" /* flags_low -> 1st (ebx) */ + "movl %4, %%ecx\n\t" /* clone_args -> 2nd (ecx)*/ + "movl %5, %%edx\n\t" /* args_size -> 3rd (edx) */ + "movl %6, %%esi\n\t" /* pids -> 4th (esi)*/ + + "pushl %%ebp\n\t" /* save value of ebp */ + "int $0x80\n\t" /* Linux/i386 system call */ + "testl %0,%0\n\t" /* check return value */ + "jne 1f\n\t" /* jump if parent */ + + "popl %%esi\n\t" /* get subthread function */ + "call *%%esi\n\t" /* start subthread function */ + "movl %2,%0\n\t" + "int $0x80\n" /* exit system call: exit subthread */ + "1:\n\t" + "popl %%ebp\t" /* restore parent's ebp */ + + :"=a" (retval) + + :"0" (__NR_eclone), + "i" (__NR_exit), + "m" (flags_low), + "m" (clone_args), + "m" (args_size), + "m" (pids) + ); + + if (retval < 0) { + errno = -retval; + retval = -1; + } + return retval; +} + +/* + * Allocate a stack for the clone-child and arrange to have the child + * execute @child_fn with @child_arg as the argument. + */ +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size) +{ + int rc; + void *stack_base; + void **stack; + + rc = posix_memalign(&stack_base, getpagesize(), size); + if (rc) { + perror("posix_memalign()"); + exit(1); + } + + stack = (void **)((unsigned long)stack_base + size); + *--stack = child_arg; + *--stack = child_fn; + + return stack_base; +} +#endif + +/* gettid() is a bit more useful than getpid() when messing with clone() */ +int gettid() +{ + int rc; + + rc = syscall(__NR_gettid, 0, 0, 0); + if (rc < 0) { + printf("rc %d, errno %d\n", rc, errno); + exit(1); + } + return rc; +} + +#define CHILD_TID1 377 +#define CHILD_TID2 1177 +#define CHILD_TID3 2799 + +struct clone_args clone_args; +void *child_arg = &clone_args; +int child_tid; + +int do_child(void *arg) +{ + struct clone_args *cs = (struct clone_args *)arg; + int ctid; + + /* Verify we pushed the arguments correctly on the stack... */ + if (arg != child_arg) { + printf("Child: Incorrect child arg pointer, expected %p," + "actual %p\n", child_arg, arg); + exit(1); + } + + /* ... and that we got the thread-id we expected */ + ctid = *((int *)(unsigned long)cs->child_tid_ptr); + if (ctid != CHILD_TID1) { + printf("Child: Incorrect child tid, expected %d, actual %d\n", + CHILD_TID1, ctid); + exit(1); + } else { + printf("Child got the expected tid, %d\n", gettid()); + } + sleep(2); + + printf("[%d, %d]: Child exiting\n", getpid(), ctid); + exit(0); +} + +static int do_clone(int (*child_fn)(void *), void *child_arg, + unsigned int flags_low, int nr_pids, pid_t *pids_list) +{ + int rc; + void *stack_base; + struct clone_args *ca = &clone_args; + int args_size; + + stack_base = setup_stack(child_fn, child_arg, STACKSIZE); + + memset(ca, 0, sizeof(*ca)); + + ca->child_stack_base = (unsigned long)stack_base; + ca->child_stack_size = (u64)STACKSIZE; + ca->child_tid_ptr = (unsigned long)&child_tid; + ca->nr_pids = nr_pids; + + args_size = sizeof(struct clone_args); + rc = eclone(flags_low, ca, args_size, pids_list); + + printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(), + rc, errno); + return rc; +} + +/* + * Multiple pid_t pid_t values in pids_list[] here are just for illustration. + * The test case creates a child in the current pid namespace and uses only + * the first value, CHILD_TID1. + */ +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 }; +int main() +{ + int rc, pid, status; + unsigned long flags; + int nr_pids = 1; + + flags = SIGCHLD|CLONE_CHILD_SETTID; + + pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list); + + printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid); + + rc = waitpid(pid, &status, __WALL); + if (rc < 0) { + printf("waitpid(): rc %d, error %d\n", rc, errno); + } else { + printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(), + gettid(), rc, status); + + if (WIFEXITED(status)) { + printf("\t EXITED, %d\n", WEXITSTATUS(status)); + } else if (WIFSIGNALED(status)) { + printf("\t SIGNALED, %d\n", WTERMSIG(status)); + } + } + return 0; +} -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 31+ messages in thread
[parent not found: <1278189164-28408-12-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <1278189164-28408-12-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> @ 2010-07-03 23:41 ` Albert Cahalan [not found] ` <AANLkTinM1jqG-9Mgbzft8bALGri7ZpzU9ZcPbMTe4fvW-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Albert Cahalan @ 2010-07-03 23:41 UTC (permalink / raw) To: Sukadev Bhattiprolu Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > +struct clone_args { > + u64 clone_flags_high; > + u64 child_stack_base; > + u64 child_stack_size; > + u64 parent_tid_ptr; > + u64 child_tid_ptr; > + u32 nr_pids; > + u32 reserved0; > +}; > + > + > +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, > + pid_t * __user pids) I don't see why cargs_size is needed for expansion if you have flags. > + The order of pids in @pids is oldest in pids[0] to youngest pid > + namespace in pids[nr_pids-1]. If the number of pids specified in the > + @pids list is fewer than the nesting level of the process, the pids > + are applied from youngest namespace. I.e if the process is nested in > + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids > + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to > + have a pid of '0' (the kernel will assign a pid in those namespaces). That feels backwards. I'd have guessed pids[0] is how the process sees itself. You'd truncate the array to reduce nesting level rather than pointing into it. > + On failure, eclone() returns -1 and sets 'errno' to one of following > + values (the child process is not created). Careful here: do you intend to document the system call itself, or an expected glibc wrapper that doesn't exist yet? > + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to > + specify the pids in this call (if pids are not specifed > + CAP_SYS_ADMIN is not required). It seems appropriate to let PID 1 in any PID namespace be able to assign PIDs in it's own namespace and in any child namespaces. > + EINVAL The child_stack_size field is not 0 (on architectures that > + pass in a stack pointer in ->child_stack field). need to change this > + "int $0x80\n\t" /* Linux/i386 system call */ > + "testl %0,%0\n\t" /* check return value */ > + "jne 1f\n\t" /* jump if parent */ > + > + "popl %%esi\n\t" /* get subthread function */ > + "call *%%esi\n\t" /* start subthread function */ > + "movl %2,%0\n\t" > + "int $0x80\n" /* exit system call: exit subthread */ ... > +/* > + * Allocate a stack for the clone-child and arrange to have the child > + * execute @child_fn with @child_arg as the argument. > + */ ... > + *--stack = child_arg; > + *--stack = child_fn; ... > +static int do_clone(int (*child_fn)(void *), void *child_arg, > + unsigned int flags_low, int nr_pids, pid_t *pids_list) There needs to be a way to pass child_fn and child_arg via the kernel. Besides being required for kernel-managed stacks, it's normally a saner interface. Stack setup would be much like the stack setup for signal handlers. Imagine using this for a vfork-like interface that didn't have painful interactions with the compiler. Speaking of vfork.... 1. can you implement it for i386 (register starved) using eclone? 2. can you restart a pair of processes between vfork and execve? ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <AANLkTinM1jqG-9Mgbzft8bALGri7ZpzU9ZcPbMTe4fvW-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <AANLkTinM1jqG-9Mgbzft8bALGri7ZpzU9ZcPbMTe4fvW-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-07-04 23:39 ` Matt Helsley [not found] ` <20100704233951.GK3338-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Matt Helsley @ 2010-07-04 23:39 UTC (permalink / raw) To: Albert Cahalan Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Sukadev Bhattiprolu, Roland McGrath On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: > On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu > <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > > > +struct clone_args { > > + u64 clone_flags_high; > > + u64 child_stack_base; > > + u64 child_stack_size; > > + u64 parent_tid_ptr; > > + u64 child_tid_ptr; > > + u32 nr_pids; > > + u32 reserved0; > > +}; > > + > > + > > +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, > > + pid_t * __user pids) > > I don't see why cargs_size is needed for expansion if you have flags. I think it's cleaner this way. The alternative you seem to be hinting at is: If we used a flag bit to indicate an expansion of the parameters then it would only be able to specify one expansion before we'd have to start using bits in the args structure itself. Using those extra bits is quite gross -- we'd have to copy the initial portion of the struct, decode the bit(s) describing the size, and then copy the rest. Also, do we have any bits left in flags_low? I thought those were all used up... Or perhaps I wasn't able to anticipate the details of your suggestion and you had something else in mind? The way Suka has it we just directly encode the size of struct clone_args as a parameter and get it over with. > > > + The order of pids in @pids is oldest in pids[0] to youngest pid > > + namespace in pids[nr_pids-1]. If the number of pids specified in the > > + @pids list is fewer than the nesting level of the process, the pids > > + are applied from youngest namespace. I.e if the process is nested in > > + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids > > + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to > > + have a pid of '0' (the kernel will assign a pid in those namespaces). > > That feels backwards. I'd have guessed pids[0] is how the > process sees itself. You'd truncate the array to reduce nesting > level rather than pointing into it. The only difference between what you ask for and what eclone does is the order of the pids. The array is truncatable as you requested. Knowing nr_pids is sufficient -- there's no need for something "pointing into it". ISTR it was ordered this way to avoid odd-looking loops or extra copying in the kernel pid code. Suka may recall the reasoning better than I do -- I'd have to dig through list archives to be certain. > > > + On failure, eclone() returns -1 and sets 'errno' to one of following > > + values (the child process is not created). > > Careful here: do you intend to document the system call itself, > or an expected glibc wrapper that doesn't exist yet? > > > + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to > > + specify the pids in this call (if pids are not specifed > > + CAP_SYS_ADMIN is not required). > > It seems appropriate to let PID 1 in any PID namespace be > able to assign PIDs in it's own namespace and in any > child namespaces. I disagree. The way you describe it, more than one "pid 1" could be involved thus the pid assignment could conflict. Especially in the case of container checkpoint/restart where one or more of those "pid 1" tasks is not aware that it's being restarted. It gets even worse if you don't assume that the same container software is being used in nested containers. > > > + EINVAL The child_stack_size field is not 0 (on architectures that > > + pass in a stack pointer in ->child_stack field). > > need to change this > > > + "int $0x80\n\t" /* Linux/i386 system call */ > > + "testl %0,%0\n\t" /* check return value */ > > + "jne 1f\n\t" /* jump if parent */ > > + > > + "popl %%esi\n\t" /* get subthread function */ > > + "call *%%esi\n\t" /* start subthread function */ > > + "movl %2,%0\n\t" > > + "int $0x80\n" /* exit system call: exit subthread */ > ... > > +/* > > + * Allocate a stack for the clone-child and arrange to have the child > > + * execute @child_fn with @child_arg as the argument. > > + */ > ... > > + *--stack = child_arg; > > + *--stack = child_fn; > ... > > +static int do_clone(int (*child_fn)(void *), void *child_arg, > > + unsigned int flags_low, int nr_pids, pid_t *pids_list) > > There needs to be a way to pass child_fn and child_arg > via the kernel. Besides being required for kernel-managed > stacks, it's normally a saner interface. Stack setup would > be much like the stack setup for signal handlers. Imagine I'm inclined to say this is a bad idea. I didn't think we had "kernel-managed stacks" in mainline. The most we have, to my knowledge, is the sigaltstack support and kernel threads. I don't see how being able to pass in child_fn and child_arg to the kernel improves the sanity of the interface. If anything it will make eclone even more exotic -- now at the end of the syscall we'll need to mess with the registers/stack of the child much like when we're invoking a signal handler. That just adds more arch-specific code than is necessary. Userspace wrappers are perfectly capable of invoking the child function and passing the arguments. Furthermore, passing those arguments requires expanding the argument structure or putting even greater pressure on registers (which, as you pointed out below, is an issue for vfork). > using this for a vfork-like interface that didn't have painful > interactions with the compiler. > > Speaking of vfork.... > > 1. can you implement it for i386 (register starved) using eclone? That's a very good question. I'm going to punt on a direct answer for now. Instead, I wonder if it's even worth enabling vfork through eclone. vfork is rarely used, is supported by the "old" clone syscall, and any old code adapted to use eclone for vfork would need significant changes because of vfork's specialness. (A consequence of the way vfork borrows page tables and must avoid clobbering parent's registers..) So while vfork support in eclone might be "nice" to have I don't think it's strictly necessary. > > 2. can you restart a pair of processes between vfork and execve? Another good question. No, we do not support that. In fact, the man page suggests vfork children are not supposed to make syscalls other than execve, kill(SIGABRT), or "_exit". To me that clearly justifies not being able to call restart from a vfork child. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20100704233951.GK3338-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <20100704233951.GK3338-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-07-05 0:45 ` Albert Cahalan [not found] ` <AANLkTilTGvFTkjy8vi8N8msB7koEp0r7SnpPqJkVN4XA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2010-07-05 4:10 ` H. Peter Anvin 2010-07-05 4:18 ` Oren Laadan 2 siblings, 1 reply; 31+ messages in thread From: Albert Cahalan @ 2010-07-05 0:45 UTC (permalink / raw) To: Matt Helsley Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Sukadev Bhattiprolu, Roland McGrath On Sun, Jul 4, 2010 at 7:39 PM, Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: >> On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu >> <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: >> >> > +struct clone_args { >> > + u64 clone_flags_high; >> > + u64 child_stack_base; >> > + u64 child_stack_size; >> > + u64 parent_tid_ptr; >> > + u64 child_tid_ptr; >> > + u32 nr_pids; >> > + u32 reserved0; >> > +}; >> > + >> > + >> > +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, >> > + pid_t * __user pids) >> >> I don't see why cargs_size is needed for expansion if you have flags. > > I think it's cleaner this way. The alternative you seem to be hinting at > is: > > If we used a flag bit to indicate an expansion of the parameters then it > would only be able to specify one expansion before we'd have to start > using bits in the args structure itself. Using those extra bits is > quite gross -- we'd have to copy the initial portion of the struct, decode > the bit(s) describing the size, and then copy the rest. Also, do we have > any bits left in flags_low? I thought those were all used up... You'd be copying from a struct in userspace to some random local variables in the kernel. There is no reason why the kernel would have to use a struct at all. You copy the flags, then see what else you need to copy. FYI, interfaces with struct size parameters have a long history of being rejected. Unless policy has changed or you've been granted a special exemption, this isn't going to go well for you. >> > + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to >> > + specify the pids in this call (if pids are not specifed >> > + CAP_SYS_ADMIN is not required). >> >> It seems appropriate to let PID 1 in any PID namespace be >> able to assign PIDs in it's own namespace and in any >> child namespaces. > > I disagree. The way you describe it, more than one "pid 1" could be > involved thus the pid assignment could conflict. Especially in the case > of container checkpoint/restart where one or more of those "pid 1" tasks > is not aware that it's being restarted. It gets even worse if you don't > assume that the same container software is being used in nested > containers. I agree that stuff may break, but that would be user error. I don't see why the kernel should be protecting tasks from being broken by strange restart situations. >> There needs to be a way to pass child_fn and child_arg >> via the kernel. Besides being required for kernel-managed >> stacks, it's normally a saner interface. Stack setup would >> be much like the stack setup for signal handlers. Imagine > > I'm inclined to say this is a bad idea. > > I didn't think we had "kernel-managed stacks" in mainline. The most we > have, to my knowledge, is the sigaltstack support and kernel threads. Right, I hope to add it if you don't. Right now a thread is unable to properly exit without help from some other thread unless the programmer dares to run without a stack. FYI, I've done just that, in the cloninator.c test code. Original http://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg105341.html http://lkml.org/lkml/2006/12/18/312 Revised http://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg106130.html http://lkml.org/lkml/2006/12/19/335 > I don't see how being able to pass in child_fn and child_arg to the > kernel improves the sanity of the interface. If anything it will make > eclone even more exotic -- now at the end of the syscall we'll > need to mess with the registers/stack of the child much like when we're > invoking a signal handler. That just adds more arch-specific code than is > necessary. > > Userspace wrappers are perfectly capable of invoking the child function > and passing the arguments. I'd rather that than have arch-specific code in random programs trying to use the interface. Imagine if setting up a signal handler required userspace code to have a pile of arch-specific stack handling code. That'd suck. >> Speaking of vfork.... >> >> 1. can you implement it for i386 (register starved) using eclone? > > That's a very good question. I'm going to punt on a direct answer for > now. Instead, I wonder if it's even worth enabling vfork through eclone. > vfork is rarely used, is supported by the "old" clone syscall, and any > old code adapted to use eclone for vfork would need significant > changes because of vfork's specialness. (A consequence of the way vfork > borrows page tables and must avoid clobbering parent's registers..) > > So while vfork support in eclone might be "nice" to have I don't think > it's strictly necessary. > >> 2. can you restart a pair of processes between vfork and execve? > > Another good question. > > No, we do not support that. In fact, the man page suggests vfork children > are not supposed to make syscalls other than execve, kill(SIGABRT), or > "_exit". To me that clearly justifies not being able to call restart from > a vfork child. You can safely ignore the man page, which was clearly written by a rather prejudiced author. People use vfork. For example, GNU make uses vfork. It can be way faster than fork for processes with large page tables. People rely on vfork to have the traditional behavior, including the ability to call things like dup2 and open. People expect the traditional parent suspension, and they expect the address space (but nothing else) to be shared. In other words, plain fork is not an acceptable substitute. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <AANLkTilTGvFTkjy8vi8N8msB7koEp0r7SnpPqJkVN4XA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <AANLkTilTGvFTkjy8vi8N8msB7koEp0r7SnpPqJkVN4XA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-07-05 10:34 ` Arnd Bergmann [not found] ` <201007051234.40943.arnd-r2nGTMty4D4@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Arnd Bergmann @ 2010-07-05 10:34 UTC (permalink / raw) To: Albert Cahalan Cc: Randy Dunlap, Serge Hallyn, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Sukadev Bhattiprolu, Roland McGrath On Monday 05 July 2010, Albert Cahalan wrote: > On Sun, Jul 4, 2010 at 7:39 PM, Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: > >> <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > >> > + > >> > +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, > >> > + pid_t * __user pids) > >> > >> I don't see why cargs_size is needed for expansion if you have flags. > > > > I think it's cleaner this way. The alternative you seem to be hinting at > > is: > > > > If we used a flag bit to indicate an expansion of the parameters then it > > would only be able to specify one expansion before we'd have to start > > using bits in the args structure itself. Using those extra bits is > > quite gross -- we'd have to copy the initial portion of the struct, decode > > the bit(s) describing the size, and then copy the rest. Also, do we have > > any bits left in flags_low? I thought those were all used up... > > You'd be copying from a struct in userspace to some random local > variables in the kernel. There is no reason why the kernel would > have to use a struct at all. You copy the flags, then see what else > you need to copy. Exactly. The size argument is also my main criticism of the suggested syscall, and I've been arguing the same as you. Note that you may still use copy the entire struct, provided that we leave enough reserved fields at the end for future extensions. If we run out of space ten years from now, we can still have a new syscall number with a new structure. Arnd ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <201007051234.40943.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <201007051234.40943.arnd-r2nGTMty4D4@public.gmane.org> @ 2010-07-06 22:25 ` Sukadev Bhattiprolu [not found] ` <20100706222554.GA7648-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-06 22:25 UTC (permalink / raw) To: Arnd Bergmann Cc: Randy Dunlap, Serge Hallyn, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath Arnd Bergmann [arnd-r2nGTMty4D4@public.gmane.org] wrote: | On Monday 05 July 2010, Albert Cahalan wrote: | > On Sun, Jul 4, 2010 at 7:39 PM, Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: | > > On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: | > >> <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: | > >> > + | > >> > +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, | > >> > + pid_t * __user pids) | > >> | > >> I don't see why cargs_size is needed for expansion if you have flags. | > > | > > I think it's cleaner this way. The alternative you seem to be hinting at | > > is: | > > | > > If we used a flag bit to indicate an expansion of the parameters then it | > > would only be able to specify one expansion before we'd have to start | > > using bits in the args structure itself. Using those extra bits is | > > quite gross -- we'd have to copy the initial portion of the struct, decode | > > the bit(s) describing the size, and then copy the rest. Also, do we have | > > any bits left in flags_low? I thought those were all used up... I think there is one bit (0x1000 before CLONE_PTRACE) left in clone_flags now. Not sure if it had any historical uses, but there was talk of trying to using that flag to extend the functionality of clone() without a new system call. IIUC, the conclusion of the discussion was that such approach would make the API messy and set a bad precedent. And the extra copy-from-user was not considered signficant. IOW, Albert, we have been through this before in [v7] or [v10] of the API :-) | > | > You'd be copying from a struct in userspace to some random local | > variables in the kernel. There is no reason why the kernel would | > have to use a struct at all. You copy the flags, then see what else | > you need to copy. | | Exactly. The size argument is also my main criticism of the suggested | syscall, and I've been arguing the same as you. | | Note that you may still use copy the entire struct, provided that we | leave enough reserved fields at the end for future extensions. If | we run out of space ten years from now, we can still have a new syscall | number with a new structure. If we need the API to be extendible, to me specifying the size in the API seems more explicit than using the flags. Of course these extensions should not mean anything goes. It could be for something that does not change the basic function of the call (like adding a new clone flag). Anything significant than that should probably be a new system-call. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20100706222554.GA7648-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <20100706222554.GA7648-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2010-07-11 9:00 ` Albert Cahalan [not found] ` <AANLkTilJMi8cXSlbG8towQFFAQpuuJjO9kDXWOfEu_EJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Albert Cahalan @ 2010-07-11 9:00 UTC (permalink / raw) To: Sukadev Bhattiprolu Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath On Tue, Jul 6, 2010 at 6:25 PM, Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > Arnd Bergmann [arnd-r2nGTMty4D4@public.gmane.org] wrote: > | On Monday 05 July 2010, Albert Cahalan wrote: > | > On Sun, Jul 4, 2010 at 7:39 PM, Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > | > > On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: > | > >> <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > | > >> > + > | > >> > +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, > | > >> > + pid_t * __user pids) > | > >> > | > >> I don't see why cargs_size is needed for expansion if you have flags. > | > > > | > > I think it's cleaner this way. The alternative you seem to be hinting at > | > > is: > | > > > | > > If we used a flag bit to indicate an expansion of the parameters then it > | > > would only be able to specify one expansion before we'd have to start > | > > using bits in the args structure itself. Using those extra bits is > | > > quite gross -- we'd have to copy the initial portion of the struct, decode > | > > the bit(s) describing the size, and then copy the rest. Also, do we have > | > > any bits left in flags_low? I thought those were all used up... > > I think there is one bit (0x1000 before CLONE_PTRACE) left in clone_flags now. > Not sure if it had any historical uses, but there was talk of trying to using > that flag to extend the functionality of clone() without a new system call. > > IIUC, the conclusion of the discussion was that such approach would make the > API messy and set a bad precedent. And the extra copy-from-user was not > considered signficant. > > IOW, Albert, we have been through this before in [v7] or [v10] of the API :-) I am not suggesting some foul abuse of the old clone syscall. I assume there will be a new syscall. Not that one couldn't cram things into the old system call, but that would involve changing the meaning of at least one parameter based on a flag. (eeew) I'm suggesting that you not copy the struct as one blob, or at least not expect to do so for future extensions to eclone. You can read the flags, use that to determine struct size, and then read the rest of the struct. Alternately you can pass 32 more flags as a 5th syscall argument. I'm not so sure we need 96 flag bits, but OK. They can all go in the struct if you like, or they can all go in the arguments. FWIW, I happen to think that both kernel and user code will be less ugly if all of the flags fit in 64 bits. C doesn't provide a 96-bit integer type. > | > You'd be copying from a struct in userspace to some random local > | > variables in the kernel. There is no reason why the kernel would > | > have to use a struct at all. You copy the flags, then see what else > | > you need to copy. > | > | Exactly. The size argument is also my main criticism of the suggested > | syscall, and I've been arguing the same as you. > | > | Note that you may still use copy the entire struct, provided that we > | leave enough reserved fields at the end for future extensions. If > | we run out of space ten years from now, we can still have a new syscall > | number with a new structure. > > If we need the API to be extendible, to me specifying the size in the > API seems more explicit than using the flags. Is there any other system call with this sort of extendability? You're going against history and IIRC policy. Most explicit would be a version number, but even that is forbidden AFAIK. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <AANLkTilJMi8cXSlbG8towQFFAQpuuJjO9kDXWOfEu_EJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <AANLkTilJMi8cXSlbG8towQFFAQpuuJjO9kDXWOfEu_EJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-07-12 18:02 ` Matt Helsley 2010-07-12 21:54 ` Sukadev Bhattiprolu 1 sibling, 0 replies; 31+ messages in thread From: Matt Helsley @ 2010-07-12 18:02 UTC (permalink / raw) To: Albert Cahalan Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Sukadev Bhattiprolu, Roland McGrath On Sun, Jul 11, 2010 at 05:00:10AM -0400, Albert Cahalan wrote: > On Tue, Jul 6, 2010 at 6:25 PM, Sukadev Bhattiprolu > <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > > Arnd Bergmann [arnd-r2nGTMty4D4@public.gmane.org] wrote: > > | On Monday 05 July 2010, Albert Cahalan wrote: > > | > On Sun, Jul 4, 2010 at 7:39 PM, Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote: > > | > > On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: > > | > >> <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: <snip> > > | > You'd be copying from a struct in userspace to some random local > > | > variables in the kernel. There is no reason why the kernel would > > | > have to use a struct at all. You copy the flags, then see what else > > | > you need to copy. > > | > > | Exactly. The size argument is also my main criticism of the suggested > > | syscall, and I've been arguing the same as you. > > | > > | Note that you may still use copy the entire struct, provided that we > > | leave enough reserved fields at the end for future extensions. If > > | we run out of space ten years from now, we can still have a new syscall > > | number with a new structure. > > > > If we need the API to be extendible, to me specifying the size in the > > API seems more explicit than using the flags. > > Is there any other system call with this sort of extendability? > You're going against history and IIRC policy. > > Most explicit would be a version number, but even that is > forbidden AFAIK. Have a look at set_robust_list() -- it passes a size parameter for a struct. Lots of socket calls take a socklen_t for a variable-size address struct. (Granted, they all go through socketcall, but that's a bit beside the point.) read() and write() syscalls operating on pseudofiles can be for structs (e.g. signalfd_siginfo) and thus indirectly do this. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <AANLkTilJMi8cXSlbG8towQFFAQpuuJjO9kDXWOfEu_EJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2010-07-12 18:02 ` Matt Helsley @ 2010-07-12 21:54 ` Sukadev Bhattiprolu [not found] ` <20100712215456.GA23721-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-12 21:54 UTC (permalink / raw) To: Albert Cahalan Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath Albert Cahalan [acahalan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] wrote: | Not that one couldn't cram | things into the old system call, but that would involve changing | the meaning of at least one parameter based on a flag. (eeew) My understanding was that extending eclone() in the future using a flag to determine the size would get the same response. Maybe I misunderstood that, but when Peter proposed that we use the 'args_size' - it looked logical to me. Sure such interfaces are not common, but I did not see any response/objections to it. http://lkml.org/lkml/2009/10/14/497 | | I'm suggesting that you not copy the struct as one blob, or at | least not expect to do so for future extensions to eclone. You | can read the flags, use that to determine struct size, and then | read the rest of the struct. Alternately you can pass 32 more flags | as a 5th syscall argument. | | I'm not so sure we need 96 flag bits, but OK. They can all go | in the struct if you like, or they can all go in the arguments. | FWIW, I happen to think that both kernel and user code will | be less ugly if all of the flags fit in 64 bits. C doesn't provide | a 96-bit integer type. We wanted to leave the original 32-bits of clone-flags as the first parameter to avoid confusing the application writers. http://lkml.org/lkml/2009/10/14/13 We do not anticipate needing 64 more flags - we may have several independent calls to unshare/clone new namespaces. We can treat the higher 32-bits as a "reserved" field for now and not worry about using all 96-bits for now. Sukadev ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20100712215456.GA23721-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <20100712215456.GA23721-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2010-07-13 6:48 ` Albert Cahalan [not found] ` <AANLkTikXDsv9CoV_EU48bunLD1wCh3w7HVdpo84rJtJg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Albert Cahalan @ 2010-07-13 6:48 UTC (permalink / raw) To: Sukadev Bhattiprolu Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath On Mon, Jul 12, 2010 at 5:54 PM, Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > Albert Cahalan [acahalan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] wrote: > > | Not that one couldn't cram > | things into the old system call, but that would involve changing > | the meaning of at least one parameter based on a flag. (eeew) > > My understanding was that extending eclone() in the future using a flag > to determine the size would get the same response. I doubt that. We extended clone via flags without trouble, increasing the number of parameters it took. We never had to change the meaning of a parameter though. Unfortunately we used up the last (sixth) free system call parameter. There is no way to cram in a pointer to a parameter struct unless we redefine one of the six existing parameters, which is yucky. For example, the stack pointer parameter could instead point at a struct if some "#define CLONE_EXT 0x80000000" is set. We've done worse I think, (mount?), but... So the distinction here is adding vs. redefining. You wouldn't need to redefine anything, since RAM space is plentiful. The old clone syscall can only be extended by redefining, because we've run out of registers. I must admit that the hack looks strangely appealing though, eliminating the need to wait for glibc or find alternatives. :-) > | I'm suggesting that you not copy the struct as one blob, or at > | least not expect to do so for future extensions to eclone. You > | can read the flags, use that to determine struct size, and then > | read the rest of the struct. Alternately you can pass 32 more flags > | as a 5th syscall argument. > | > | I'm not so sure we need 96 flag bits, but OK. They can all go > | in the struct if you like, or they can all go in the arguments. > | FWIW, I happen to think that both kernel and user code will > | be less ugly if all of the flags fit in 64 bits. C doesn't provide > | a 96-bit integer type. > > We wanted to leave the original 32-bits of clone-flags as the first > parameter to avoid confusing the application writers. > > http://lkml.org/lkml/2009/10/14/13 IMHO, having multiple flags fields is far more confusing. It's easier to document the high flag case ("This flag requires the eclone system call.") than to document multiple sets of incompatible flags for one system call. I always curse the nasty multi-flagged mmap interface whenever I use it. If you're going to be concerned with flags getting truncated via improper use of the syscall, then you should be at least as concerned with flags getting passed in the wrong place. For example, one might pass all flags (high and low) in the struct. Somebody else might pass all flags in the parameters. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <AANLkTikXDsv9CoV_EU48bunLD1wCh3w7HVdpo84rJtJg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <AANLkTikXDsv9CoV_EU48bunLD1wCh3w7HVdpo84rJtJg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-07-20 22:13 ` Sukadev Bhattiprolu [not found] ` <20100720221357.GA2440-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Sukadev Bhattiprolu @ 2010-07-20 22:13 UTC (permalink / raw) To: Albert Cahalan Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath Albert Cahalan [acahalan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] wrote: | On Mon, Jul 12, 2010 at 5:54 PM, Sukadev Bhattiprolu | <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: | > Albert Cahalan [acahalan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] wrote: | > | > | Not that one couldn't cram | > | things into the old system call, but that would involve changing | > | the meaning of at least one parameter based on a flag. (eeew) | > | > My understanding was that extending eclone() in the future using a flag | > to determine the size would get the same response. | | I doubt that. We extended clone via flags without trouble, | increasing the number of parameters it took. We never had | to change the meaning of a parameter though. Unfortunately | we used up the last (sixth) free system call parameter. We only use 5 args for clone() right - so the 6th (%ebp) is still available for extension ? sys_clone(flags, child_stack, parent_tid, tls, child_tid) The problem we have is that we needed two more parameters, pid_t *pids and 'struct clone_args' - again this was discussed in [v5] or [v6] of the patchset. But regardless, I think in the long run, it would be cleaner if we created a new system call that is common across all architectures. | | > | I'm suggesting that you not copy the struct as one blob, or at | > | least not expect to do so for future extensions to eclone. You | > | can read the flags, use that to determine struct size, and then | > | read the rest of the struct. Alternately you can pass 32 more flags | > | as a 5th syscall argument. | > | | > | I'm not so sure we need 96 flag bits, but OK. They can all go | > | in the struct if you like, or they can all go in the arguments. | > | FWIW, I happen to think that both kernel and user code will | > | be less ugly if all of the flags fit in 64 bits. C doesn't provide | > | a 96-bit integer type. | > | > We wanted to leave the original 32-bits of clone-flags as the first | > parameter to avoid confusing the application writers. | > | > http://lkml.org/lkml/2009/10/14/13 | | IMHO, having multiple flags fields is far more confusing. It is confusing but like I said, it was discussed (see http://lkml.org/lkml/2009/10/14/6) and the conclusion was that silent data corruption is worse. So, I am leaving the interface as defined in this version of the patchset. Arnd, Peter, Roland, do you have any thoughts on Albert's comments ? Maybe we can define other API to set/clear clone flags like the sigset() interface when we need more than 32 bit flags. | It's easier to document the high flag case ("This flag requires | the eclone system call."). Hmm, with the current API, all newer flags will go into ->clone_flags_high. so we would have to say something like: CLONE_NEWFOO requires eclone() and must be set in ->clone_flags_high field. | than to document multiple sets | of incompatible flags for one system call. I always curse | the nasty multi-flagged mmap interface whenever I use it. | If you're going to be concerned with flags getting truncated | via improper use of the syscall, then you should be at least | as concerned with flags getting passed in the wrong place. | For example, one might pass all flags (high and low) in the | struct. Somebody else might pass all flags in the parameters. Maybe the sigset() like interface will help. Sukadev ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <20100720221357.GA2440-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <20100720221357.GA2440-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2010-07-21 19:51 ` Albert Cahalan 0 siblings, 0 replies; 31+ messages in thread From: Albert Cahalan @ 2010-07-21 19:51 UTC (permalink / raw) To: Sukadev Bhattiprolu Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Roland McGrath On Tue, Jul 20, 2010 at 6:13 PM, Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > Albert Cahalan [acahalan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] wrote: > | On Mon, Jul 12, 2010 at 5:54 PM, Sukadev Bhattiprolu > | <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > | > Albert Cahalan [acahalan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org] wrote: > | > | I'm suggesting that you not copy the struct as one blob, or at > | > | least not expect to do so for future extensions to eclone. You > | > | can read the flags, use that to determine struct size, and then > | > | read the rest of the struct. Alternately you can pass 32 more flags > | > | as a 5th syscall argument. > | > | > | > | I'm not so sure we need 96 flag bits, but OK. They can all go > | > | in the struct if you like, or they can all go in the arguments. > | > | FWIW, I happen to think that both kernel and user code will > | > | be less ugly if all of the flags fit in 64 bits. C doesn't provide > | > | a 96-bit integer type. > | > > | > We wanted to leave the original 32-bits of clone-flags as the first > | > parameter to avoid confusing the application writers. > | > > | > http://lkml.org/lkml/2009/10/14/13 > | > | IMHO, having multiple flags fields is far more confusing. > > It is confusing but like I said, it was discussed (see > http://lkml.org/lkml/2009/10/14/6) and the conclusion was that silent > data corruption is worse. So, I am leaving the interface as defined in > this version of the patchset. The current interface encourages silent data corruption of a different type. It's far more likely to happen I think; putting flags in the wrong field is what people will do. Either way, this is a trivial problem for freshly-written code. I'd rather the problem affect an obsolete interface, keeping the new interface nice and clean with a single 64-bit value. > Maybe we can define other API to set/clear clone flags like the sigset() > interface when we need more than 32 bit flags. ... > Maybe the sigset() like interface will help. That works, but it sure is a pain to use. BTW, the flags used to give clone extra arguments are no longer needed and are thus available for reuse. This wouldn't be confusing if there were ECLONE_* defines. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <20100704233951.GK3338-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-07-05 0:45 ` Albert Cahalan @ 2010-07-05 4:10 ` H. Peter Anvin [not found] ` <4C315B42.1020201-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org> 2010-07-05 4:18 ` Oren Laadan 2 siblings, 1 reply; 31+ messages in thread From: H. Peter Anvin @ 2010-07-05 4:10 UTC (permalink / raw) To: Matt Helsley Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, Dan Smith, Sukadev Bhattiprolu, Roland McGrath On 07/04/2010 04:39 PM, Matt Helsley wrote: >> >> 1. can you implement it for i386 (register starved) using eclone? > > That's a very good question. I'm going to punt on a direct answer for > now. Instead, I wonder if it's even worth enabling vfork through eclone. > vfork is rarely used, is supported by the "old" clone syscall, and any > old code adapted to use eclone for vfork would need significant > changes because of vfork's specialness. (A consequence of the way vfork > borrows page tables and must avoid clobbering parent's registers..) > vfork is its own system call for a reason. We used to do it with sys_clone, and it turned out to be a mess. Doing it in a separate system call -- even though the internals are largely the same -- is cleaner. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <4C315B42.1020201-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <4C315B42.1020201-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org> @ 2010-07-06 4:11 ` Albert Cahalan [not found] ` <AANLkTimWBFbcRyf5tvA9Lork13gAJtCAdUg_ZS3PzbI0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Albert Cahalan @ 2010-07-06 4:11 UTC (permalink / raw) To: H. Peter Anvin Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, Dan Smith, Sukadev Bhattiprolu, Roland McGrath On Mon, Jul 5, 2010 at 12:10 AM, H. Peter Anvin <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org> wrote: > On 07/04/2010 04:39 PM, Matt Helsley wrote: >>> >>> 1. can you implement it for i386 (register starved) using eclone? >> >> That's a very good question. I'm going to punt on a direct answer for >> now. Instead, I wonder if it's even worth enabling vfork through eclone. >> vfork is rarely used, is supported by the "old" clone syscall, and any >> old code adapted to use eclone for vfork would need significant >> changes because of vfork's specialness. (A consequence of the way vfork >> borrows page tables and must avoid clobbering parent's registers..) > > vfork is its own system call for a reason. We used to do it with > sys_clone, and it turned out to be a mess. Doing it in a separate > system call -- even though the internals are largely the same -- is cleaner. That's interesting; only ia64 and xtensa still do vfork via clone. I remember sys_vfork being purely i386. I guess we need an evfork then. We're slowly gaining all sorts of functionality (various clone flags) that is inaccessible when there is a need for vfork behavior. To greatly reduce the mess, I strongly suggest that it have the int(*func)(void*arg) and void*arg arguments like BSD's rfork_thread. Note that this mess reduction would be severe enough to solve the problems with sharing a syscall. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <AANLkTimWBFbcRyf5tvA9Lork13gAJtCAdUg_ZS3PzbI0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <AANLkTimWBFbcRyf5tvA9Lork13gAJtCAdUg_ZS3PzbI0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-07-06 15:14 ` Oren Laadan 0 siblings, 0 replies; 31+ messages in thread From: Oren Laadan @ 2010-07-06 15:14 UTC (permalink / raw) To: Albert Cahalan Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Sukadev Bhattiprolu, Roland McGrath Albert Cahalan wrote: > On Mon, Jul 5, 2010 at 12:10 AM, H. Peter Anvin <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org> wrote: >> On 07/04/2010 04:39 PM, Matt Helsley wrote: >>>> 1. can you implement it for i386 (register starved) using eclone? >>> That's a very good question. I'm going to punt on a direct answer for >>> now. Instead, I wonder if it's even worth enabling vfork through eclone. >>> vfork is rarely used, is supported by the "old" clone syscall, and any >>> old code adapted to use eclone for vfork would need significant >>> changes because of vfork's specialness. (A consequence of the way vfork >>> borrows page tables and must avoid clobbering parent's registers..) >> vfork is its own system call for a reason. We used to do it with >> sys_clone, and it turned out to be a mess. Doing it in a separate >> system call -- even though the internals are largely the same -- is cleaner. > > That's interesting; only ia64 and xtensa still do vfork via clone. > I remember sys_vfork being purely i386. > > I guess we need an evfork then. We're slowly gaining all sorts > of functionality (various clone flags) that is inaccessible when > there is a need for vfork behavior. For the record, as far as I can tell, we don't need evfork() for checkpoint-restart. So you'd need to come up with other use cases. Oren. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <20100704233951.GK3338-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-07-05 0:45 ` Albert Cahalan 2010-07-05 4:10 ` H. Peter Anvin @ 2010-07-05 4:18 ` Oren Laadan [not found] ` <4C315D2D.6000107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2 siblings, 1 reply; 31+ messages in thread From: Oren Laadan @ 2010-07-05 4:18 UTC (permalink / raw) To: Matt Helsley Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Sukadev Bhattiprolu, Roland McGrath Matt Helsley wrote: > On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: >> On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu >> <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: >> [...] >>> + The order of pids in @pids is oldest in pids[0] to youngest pid >>> + namespace in pids[nr_pids-1]. If the number of pids specified in the >>> + @pids list is fewer than the nesting level of the process, the pids >>> + are applied from youngest namespace. I.e if the process is nested in >>> + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids >>> + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to >>> + have a pid of '0' (the kernel will assign a pid in those namespaces). >> That feels backwards. I'd have guessed pids[0] is how the >> process sees itself. You'd truncate the array to reduce nesting >> level rather than pointing into it. > > The only difference between what you ask for and what eclone does is the > order of the pids. The array is truncatable as you requested. Knowing nr_pids > is sufficient -- there's no need for something "pointing into it". > > ISTR it was ordered this way to avoid odd-looking loops or extra copying in > the kernel pid code. Suka may recall the reasoning better than I do -- I'd > have to dig through list archives to be certain. This is what I remember, too: the idea was to order the pid's in the order that the kernel - and (many of ?) us humans - view of them: top-down. IOW, this corresponds to the hierarchical nature of the pid-namespace hierarchy: ancestors come first, followed by descendants. > >>> + On failure, eclone() returns -1 and sets 'errno' to one of following >>> + values (the child process is not created). >> Careful here: do you intend to document the system call itself, >> or an expected glibc wrapper that doesn't exist yet? >> >>> + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to >>> + specify the pids in this call (if pids are not specifed >>> + CAP_SYS_ADMIN is not required). >> It seems appropriate to let PID 1 in any PID namespace be >> able to assign PIDs in it's own namespace and in any >> child namespaces. > > I disagree. The way you describe it, more than one "pid 1" could be > involved thus the pid assignment could conflict. Especially in the case > of container checkpoint/restart where one or more of those "pid 1" tasks > is not aware that it's being restarted. It gets even worse if you don't > assume that the same container software is being used in nested > containers. I assume that by "any child namespace" you mean "any descendant namespace" ? I'm with Matt here. In particular, I make the distinction between where a process "lives" and where it is "visible". A process is visible in all the pid-namespaces up its ancestry. However, it "lives" only in the bottom-most pid-namespace. For example, if it calls kill(2), it will affect another process in _that_ namespace. It follows that trying to set pid's in pid-namespaces _below_ you simply doesn't make sense (beyond the CLONE_NEWPID case). Finally, there have been objections before to allow pid-selection by non-privileged process. If such functionality is deemed desired later on, it can be easily added. However, let's separate that discussion from the current discussion. [...] >>> +static int do_clone(int (*child_fn)(void *), void *child_arg, >>> + unsigned int flags_low, int nr_pids, pid_t *pids_list) >> There needs to be a way to pass child_fn and child_arg >> via the kernel. Besides being required for kernel-managed >> stacks, it's normally a saner interface. Stack setup would >> be much like the stack setup for signal handlers. Imagine > > I'm inclined to say this is a bad idea. > > I didn't think we had "kernel-managed stacks" in mainline. The most we > have, to my knowledge, is the sigaltstack support and kernel threads. > > I don't see how being able to pass in child_fn and child_arg to the > kernel improves the sanity of the interface. If anything it will make > eclone even more exotic -- now at the end of the syscall we'll > need to mess with the registers/stack of the child much like when we're > invoking a signal handler. That just adds more arch-specific code than is > necessary. > > Userspace wrappers are perfectly capable of invoking the child function > and passing the arguments. Furthermore, passing those arguments requires > expanding the argument structure or putting even greater pressure on > registers (which, as you pointed out below, is an issue for vfork). > >> using this for a vfork-like interface that didn't have painful >> interactions with the compiler. Pardon my ignorance - what sort of painful interactions ? >> >> Speaking of vfork.... >> >> 1. can you implement it for i386 (register starved) using eclone? > > That's a very good question. I'm going to punt on a direct answer for > now. Instead, I wonder if it's even worth enabling vfork through eclone. > vfork is rarely used, is supported by the "old" clone syscall, and any > old code adapted to use eclone for vfork would need significant > changes because of vfork's specialness. (A consequence of the way vfork > borrows page tables and must avoid clobbering parent's registers..) > > So while vfork support in eclone might be "nice" to have I don't think > it's strictly necessary. > >> 2. can you restart a pair of processes between vfork and execve? > > Another good question. > > No, we do not support that. In fact, the man page suggests vfork children > are not supposed to make syscalls other than execve, kill(SIGABRT), or > "_exit". To me that clearly justifies not being able to call restart from > a vfork child. Indeed good question. Let me add: "we do not support YET". It is not too complicated add support to vfork - it merely requires that we emulate the behavior of both parent and child once they have completed to restore their state, to ensure that the parent waits for the child's completion. (BTW, similar logic will be required to support checkpoint and especially restart of ptraced processes). It's just that it will require some (minor, I believe) edits to the vfork code - which I want to avoid as long as we are awaiting more review of the current code. Oren. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <4C315D2D.6000107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <4C315D2D.6000107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-07-06 3:59 ` Albert Cahalan [not found] ` <AANLkTinr_2u-_0S2UvMDc7hOE_JOVIOjGtVo9Tzuk21E-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 31+ messages in thread From: Albert Cahalan @ 2010-07-06 3:59 UTC (permalink / raw) To: Oren Laadan Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Sukadev Bhattiprolu, Roland McGrath On Mon, Jul 5, 2010 at 12:18 AM, Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> wrote: > Matt Helsley wrote: >> On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: >>> On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu >>> <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > It follows that trying to set pid's in pid-namespaces _below_ you > simply doesn't make sense (beyond the CLONE_NEWPID case). I may have some wrong ideas about how process restart works, but I'd thought it would normally be done from above or from PID 1 in the same pid namespace. > Finally, there have been objections before to allow pid-selection > by non-privileged process. Eh, I dearly hope that privileged processes are generally not even addressable (never mind creatable or accessable) from inside anything other than the top-level pid namespace. Well, at least nothing should get more privilege than PID 1. This would include having UID values that PID 1 can switch to and having capability sets that PID 1 can switch to, and any other (SE Linux, AppArmor, etc.) things too. Restarting a privileged process with a less privileged PID 1 should result in privilege loss, and ought to require some sort of "--force" option to ensure the person accepts possible breakage. >>>> +static int do_clone(int (*child_fn)(void *), void *child_arg, >>>> + unsigned int flags_low, int nr_pids, pid_t *pids_list) >>> >>> There needs to be a way to pass child_fn and child_arg >>> via the kernel. Besides being required for kernel-managed >>> stacks, it's normally a saner interface. Stack setup would >>> be much like the stack setup for signal handlers. Imagine >> >> I'm inclined to say this is a bad idea. >> >> I didn't think we had "kernel-managed stacks" in mainline. The most we >> have, to my knowledge, is the sigaltstack support and kernel threads. >> >> I don't see how being able to pass in child_fn and child_arg to the >> kernel improves the sanity of the interface. If anything it will make >> eclone even more exotic -- now at the end of the syscall we'll >> need to mess with the registers/stack of the child much like when we're >> invoking a signal handler. That just adds more arch-specific code than is >> necessary. >> >> Userspace wrappers are perfectly capable of invoking the child function >> and passing the arguments. Furthermore, passing those arguments requires >> expanding the argument structure or putting even greater pressure on >> registers (which, as you pointed out below, is an issue for vfork). BSD's rfork_thread has, among other things, these two arguments: int (*func)(void *arg) void *arg >>> using this for a vfork-like interface that didn't have painful >>> interactions with the compiler. > > Pardon my ignorance - what sort of painful interactions ? The child returns from vfork, via the same return address that the parent will later use. (on the stack for many architectures) The child then calls a function which might not have the same stack layout as vfork, scrambling whatever may be on the stack that the parent will be using to return from vfork. The parent may then end up using a return address that has been corrupted. To make this work, gcc actually recognizes vfork and has special handling for it. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <AANLkTinr_2u-_0S2UvMDc7hOE_JOVIOjGtVo9Tzuk21E-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <AANLkTinr_2u-_0S2UvMDc7hOE_JOVIOjGtVo9Tzuk21E-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-07-06 13:12 ` Serge E. Hallyn 2010-07-06 15:12 ` Oren Laadan 1 sibling, 0 replies; 31+ messages in thread From: Serge E. Hallyn @ 2010-07-06 13:12 UTC (permalink / raw) To: Albert Cahalan Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Sukadev Bhattiprolu, Roland McGrath Quoting Albert Cahalan (acahalan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org): > On Mon, Jul 5, 2010 at 12:18 AM, Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> wrote: > > Finally, there have been objections before to allow pid-selection > > by non-privileged process. > > Eh, I dearly hope that privileged processes are generally not > even addressable (never mind creatable or accessable) from > inside anything other than the top-level pid namespace. If a privileged task was created in the top-level pid namespace, then it is not addressable from inside a descendent pid namespace. > Well, at least nothing should get more privilege than PID 1. > This would include having UID values that PID 1 can switch > to and having capability sets that PID 1 can switch to, and > any other (SE Linux, AppArmor, etc.) things too. IIUC the spirit of what you say here is what is intended by the completion of the user namespaces. They'll ensure that things like setuid-root and file capabilities limit privilege to resources owned by the task which created the namespace. That's why unprivileged pid ns unsharing won't be considered until user namespaces are completed. > Restarting a privileged process with a less privileged PID 1 > should result in privilege loss, and ought to require some sort of > "--force" option to ensure the person accepts possible breakage. Interesting point. Do we allow ptrace of a container init? -serge ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <AANLkTinr_2u-_0S2UvMDc7hOE_JOVIOjGtVo9Tzuk21E-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2010-07-06 13:12 ` Serge E. Hallyn @ 2010-07-06 15:12 ` Oren Laadan [not found] ` <4C3347CA.8060703-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 1 sibling, 1 reply; 31+ messages in thread From: Oren Laadan @ 2010-07-06 15:12 UTC (permalink / raw) To: Albert Cahalan Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Containers, Nathan Lynch, H. Peter Anvin, Dan Smith, Sukadev Bhattiprolu, Roland McGrath Albert Cahalan wrote: > On Mon, Jul 5, 2010 at 12:18 AM, Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> wrote: >> Matt Helsley wrote: >>> On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: >>>> On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu >>>> <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > >> It follows that trying to set pid's in pid-namespaces _below_ you >> simply doesn't make sense (beyond the CLONE_NEWPID case). > > I may have some wrong ideas about how process restart works, > but I'd thought it would normally be done from above or from PID 1 > in the same pid namespace. > >> Finally, there have been objections before to allow pid-selection >> by non-privileged process. > > Eh, I dearly hope that privileged processes are generally not > even addressable (never mind creatable or accessable) from > inside anything other than the top-level pid namespace. > > Well, at least nothing should get more privilege than PID 1. > This would include having UID values that PID 1 can switch > to and having capability sets that PID 1 can switch to, and > any other (SE Linux, AppArmor, etc.) things too. > > Restarting a privileged process with a less privileged PID 1 > should result in privilege loss, and ought to require some sort of > "--force" option to ensure the person accepts possible breakage. > >>>>> +static int do_clone(int (*child_fn)(void *), void *child_arg, >>>>> + unsigned int flags_low, int nr_pids, pid_t *pids_list) >>>> There needs to be a way to pass child_fn and child_arg >>>> via the kernel. Besides being required for kernel-managed >>>> stacks, it's normally a saner interface. Stack setup would >>>> be much like the stack setup for signal handlers. Imagine >>> I'm inclined to say this is a bad idea. >>> >>> I didn't think we had "kernel-managed stacks" in mainline. The most we >>> have, to my knowledge, is the sigaltstack support and kernel threads. >>> >>> I don't see how being able to pass in child_fn and child_arg to the >>> kernel improves the sanity of the interface. If anything it will make >>> eclone even more exotic -- now at the end of the syscall we'll >>> need to mess with the registers/stack of the child much like when we're >>> invoking a signal handler. That just adds more arch-specific code than is >>> necessary. >>> >>> Userspace wrappers are perfectly capable of invoking the child function >>> and passing the arguments. Furthermore, passing those arguments requires >>> expanding the argument structure or putting even greater pressure on >>> registers (which, as you pointed out below, is an issue for vfork). > > BSD's rfork_thread has, among other things, these two arguments: > > int (*func)(void *arg) > void *arg > >>>> using this for a vfork-like interface that didn't have painful >>>> interactions with the compiler. >> Pardon my ignorance - what sort of painful interactions ? > > The child returns from vfork, via the same return address that > the parent will later use. (on the stack for many architectures) > The child then calls a function which might not have the same > stack layout as vfork, scrambling whatever may be on the stack > that the parent will be using to return from vfork. The parent may > then end up using a return address that has been corrupted. > To make this work, gcc actually recognizes vfork and has > special handling for it. I assumed that this is taken care of by libc rather than the compiler, like it is done for clone(2). Oren. ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <4C3347CA.8060703-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [PATCH 11/11][v15]: Document sys_eclone [not found] ` <4C3347CA.8060703-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-07-06 22:23 ` H. Peter Anvin 0 siblings, 0 replies; 31+ messages in thread From: H. Peter Anvin @ 2010-07-06 22:23 UTC (permalink / raw) To: Oren Laadan Cc: Randy Dunlap, Serge Hallyn, Arnd Bergmann, Albert Cahalan, Containers, Nathan Lynch, Dan Smith, Sukadev Bhattiprolu, Roland McGrath On 07/06/2010 08:12 AM, Oren Laadan wrote: >> >> The child returns from vfork, via the same return address that >> the parent will later use. (on the stack for many architectures) >> The child then calls a function which might not have the same >> stack layout as vfork, scrambling whatever may be on the stack >> that the parent will be using to return from vfork. The parent may >> then end up using a return address that has been corrupted. >> To make this work, gcc actually recognizes vfork and has >> special handling for it. > > I assumed that this is taken care of by libc rather than the > compiler, like it is done for clone(2). > No, vfork is *really* special, because the two threads share a stack. -hpa ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2010-07-21 19:51 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-03 20:32 [PATCH 00/11][v15]: Implement eclone() system call Sukadev Bhattiprolu
[not found] ` <1278189164-28408-1-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-03 20:32 ` [PATCH 01/11][v15]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
2010-07-03 20:32 ` [PATCH 02/11][v15]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
2010-07-03 20:32 ` [PATCH 03/11][v15]: Define set_pidmap() function Sukadev Bhattiprolu
2010-07-03 20:32 ` [PATCH 04/11][v15]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
2010-07-03 20:32 ` [PATCH 05/11][v15]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2010-07-03 20:32 ` [PATCH 06/11][v15]: Check invalid clone flags Sukadev Bhattiprolu
2010-07-03 20:32 ` [PATCH 07/11][v15]: Define do_fork_with_pids() Sukadev Bhattiprolu
2010-07-03 20:32 ` [PATCH 08/11][v15]: Implement sys_eclone for x86 (32,64) Sukadev Bhattiprolu
2010-07-03 20:32 ` [PATCH 09/11][v15]: Implement sys_eclone for s390 Sukadev Bhattiprolu
2010-07-03 20:32 ` [PATCH 10/11][v15]: Implement sys_eclone for powerpc Sukadev Bhattiprolu
2010-07-03 20:32 ` [PATCH 11/11][v15]: Document sys_eclone Sukadev Bhattiprolu
[not found] ` <1278189164-28408-12-git-send-email-sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2010-07-03 23:41 ` Albert Cahalan
[not found] ` <AANLkTinM1jqG-9Mgbzft8bALGri7ZpzU9ZcPbMTe4fvW-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-04 23:39 ` Matt Helsley
[not found] ` <20100704233951.GK3338-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-07-05 0:45 ` Albert Cahalan
[not found] ` <AANLkTilTGvFTkjy8vi8N8msB7koEp0r7SnpPqJkVN4XA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-05 10:34 ` Arnd Bergmann
[not found] ` <201007051234.40943.arnd-r2nGTMty4D4@public.gmane.org>
2010-07-06 22:25 ` Sukadev Bhattiprolu
[not found] ` <20100706222554.GA7648-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-07-11 9:00 ` Albert Cahalan
[not found] ` <AANLkTilJMi8cXSlbG8towQFFAQpuuJjO9kDXWOfEu_EJ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-12 18:02 ` Matt Helsley
2010-07-12 21:54 ` Sukadev Bhattiprolu
[not found] ` <20100712215456.GA23721-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-07-13 6:48 ` Albert Cahalan
[not found] ` <AANLkTikXDsv9CoV_EU48bunLD1wCh3w7HVdpo84rJtJg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-20 22:13 ` Sukadev Bhattiprolu
[not found] ` <20100720221357.GA2440-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-07-21 19:51 ` Albert Cahalan
2010-07-05 4:10 ` H. Peter Anvin
[not found] ` <4C315B42.1020201-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>
2010-07-06 4:11 ` Albert Cahalan
[not found] ` <AANLkTimWBFbcRyf5tvA9Lork13gAJtCAdUg_ZS3PzbI0-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-06 15:14 ` Oren Laadan
2010-07-05 4:18 ` Oren Laadan
[not found] ` <4C315D2D.6000107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-07-06 3:59 ` Albert Cahalan
[not found] ` <AANLkTinr_2u-_0S2UvMDc7hOE_JOVIOjGtVo9Tzuk21E-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-06 13:12 ` Serge E. Hallyn
2010-07-06 15:12 ` Oren Laadan
[not found] ` <4C3347CA.8060703-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-07-06 22:23 ` H. Peter Anvin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.