* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
@ 2010-05-29 10:31 Albert Cahalan
2010-06-01 19:32 ` Sukadev Bhattiprolu
0 siblings, 1 reply; 14+ messages in thread
From: Albert Cahalan @ 2010-05-29 10:31 UTC (permalink / raw)
To: linux-kernel, sukadev, randy.dunlap, linuxppc-dev
Sukadev Bhattiprolu writes:
> Randy Dunlap [randy.dunlap at oracle.com] wrote:
>>> base of the region allocated for stack. These architectures
>>> must pass in the size of the stack-region in ->child_stack_size.
>>
>> stack region
>>
>> Seems unfortunate that different architectures use
>> the fields differently.
>
> Yes and no. The field still has a single purpose, just that
> some architectures may not need it. We enforce that if unused
> on an architecture, the field must be 0. It looked like
> the easiest way to keep the API common across architectures.
Yuck. You're forcing userspace to have #ifdef messes or,
more likely, just not work on all architectures. There is
no reason to have field usage vary by architecture. The
original clone syscall was not designed with ia64 and hppa
in mind, and has been causing trouble ever since. Let's not
perpetuate the problem.
Given code like this: stack_base = malloc(stack_size);
stack_base and stack_size are what the kernel needs.
I suspect that you chose the defective method for some reason
related to restarting processes that were created with the
older system calls. I can't say most of us even care, but in
that broken-already case your process restarter can make up
some numbers that will work. (for i386, the base could be the
lowest address in the vma in which %esp lies, or even address 0)
A related issue is that stack allocation and deallocation can
be quite painful: it is difficult (some assembly required) to
free one's own stack, and impossible if one is already dead.
We could use a flag to let the kernel handle allocation, with
the stack getting freed just after any ptracer gets a last look.
This issue is especially troublesome for me because the syscall
essentially requires per-thread memory to work; it is currently
extremely difficult to use the syscall in code which lacks that.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-05-29 10:31 [PATCH v21 011/100] eclone (11/11): Document sys_eclone Albert Cahalan
@ 2010-06-01 19:32 ` Sukadev Bhattiprolu
2010-06-01 19:59 ` Albert Cahalan
0 siblings, 1 reply; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-01 19:32 UTC (permalink / raw)
To: Albert Cahalan; +Cc: linux-kernel, randy.dunlap, linuxppc-dev
Albert Cahalan [acahalan@gmail.com] wrote:
| Sukadev Bhattiprolu writes:
|
| > Randy Dunlap [randy.dunlap at oracle.com] wrote:
| >>> base of the region allocated for stack. These architectures
| >>> must pass in the size of the stack-region in ->child_stack_size.
| >>
| >> stack region
| >>
| >> Seems unfortunate that different architectures use
| >> the fields differently.
| >
| > Yes and no. The field still has a single purpose, just that
| > some architectures may not need it. We enforce that if unused
| > on an architecture, the field must be 0. It looked like
| > the easiest way to keep the API common across architectures.
|
| Yuck. You're forcing userspace to have #ifdef messes or,
| more likely, just not work on all architectures.
There is going to be #ifdef code in the library interface to eclone().
But applications should not need any #ifdefs. Please see the test cases
for eclone in
git://git.sr71.net/~hallyn/cr_tests.git
There is no #ifdef and the tests work on x86, x86_64, ppc, s390.
These use the libeclone.a built from following git-tree, which has the
arch-dependent user space code.
git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
Is that the #ifdef mess you are talking about ? I don't see that as a
consequence of the API. So maybe you can elaborate.
| There is no reason to have field usage vary by architecture. The
The field usage does not vary by architecture. Some architectures
don't use some fields and those fields must be 0. A simple
memset(&clone_args, 0, sizeof(clone_args))
before initializing fields is all that is required.
| original clone syscall was not designed with ia64 and hppa
| in mind, and has been causing trouble ever since. Let's not
| perpetuate the problem.
and lot of folks contributed to this new API to try and make sure
it is portable and meets the forseeable requirements.
|
| Given code like this: stack_base = malloc(stack_size);
| stack_base and stack_size are what the kernel needs.
|
| I suspect that you chose the defective method for some reason
| related to restarting processes that were created with the
| older system calls. I can't say most of us even care, but in
| that broken-already case your process restarter can make up
| some numbers that will work. (for i386, the base could be the
| lowest address in the vma in which %esp lies, or even address 0)
I don't understand how "making up some numbers (pids) that will work"
is more portable/cleaner than the proposed eclone().
Sukadev
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-06-01 19:32 ` Sukadev Bhattiprolu
@ 2010-06-01 19:59 ` Albert Cahalan
2010-06-02 1:38 ` Sukadev Bhattiprolu
0 siblings, 1 reply; 14+ messages in thread
From: Albert Cahalan @ 2010-06-01 19:59 UTC (permalink / raw)
To: Sukadev Bhattiprolu; +Cc: linux-kernel, randy.dunlap, linuxppc-dev
On Tue, Jun 1, 2010 at 3:32 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> Albert Cahalan [acahalan@gmail.com] wrote:
> | Sukadev Bhattiprolu writes:
> | > Randy Dunlap [randy.dunlap at oracle.com] wrote:
> | >>> base of the region allocated for stack. These architectures
> | >>> must pass in the size of the stack-region in ->child_stack_size.
> | >>
> | >> stack region
> | >>
> | >> Seems unfortunate that different architectures use
> | >> the fields differently.
> | >
> | > Yes and no. The field still has a single purpose, just that
> | > some architectures may not need it. We enforce that if unused
> | > on an architecture, the field must be 0. It looked like
> | > the easiest way to keep the API common across architectures.
> |
> | Yuck. You're forcing userspace to have #ifdef messes or,
> | more likely, just not work on all architectures.
>
> There is going to be #ifdef code in the library interface to eclone().
> But applications should not need any #ifdefs. Please see the test cases
> for eclone in
>
> git://git.sr71.net/~hallyn/cr_tests.git
>
> There is no #ifdef and the tests work on x86, x86_64, ppc, s390.
Come on, seriously, you know it's ia64 and hppa that
have issues. Maybe the nommu ports also have issues.
The only portable way to specify the stack is base and offset,
with flags or magic values for "share" and "kernel managed".
> | There is no reason to have field usage vary by architecture. The
>
> The field usage does not vary by architecture. Some architectures
> don't use some fields and those fields must be 0.
It looks like you contradict yourself. Please explain how
those two sentences are compatible.
> | original clone syscall was not designed with ia64 and hppa
> | in mind, and has been causing trouble ever since. Let's not
> | perpetuate the problem.
>
> and lot of folks contributed to this new API to try and make sure
> it is portable and meets the forseeable requirements.
Right, and some folks were ignored.
> | Given code like this: stack_base = malloc(stack_size);
> | stack_base and stack_size are what the kernel needs.
> |
> | I suspect that you chose the defective method for some reason
> | related to restarting processes that were created with the
> | older system calls. I can't say most of us even care, but in
> | that broken-already case your process restarter can make up
> | some numbers that will work. (for i386, the base could be the
> | lowest address in the vma in which %esp lies, or even address 0)
>
> I don't understand how "making up some numbers (pids) that will work"
> is more portable/cleaner than the proposed eclone().
It isolates the cross-platform problems to an obscure tool
instead of polluting the kernel interface that everybody uses.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-06-01 19:59 ` Albert Cahalan
@ 2010-06-02 1:38 ` Sukadev Bhattiprolu
2010-06-05 11:49 ` Albert Cahalan
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-02 1:38 UTC (permalink / raw)
To: Albert Cahalan; +Cc: linux-kernel, randy.dunlap, linuxppc-dev
| Come on, seriously, you know it's ia64 and hppa that
| have issues. Maybe the nommu ports also have issues.
|
| The only portable way to specify the stack is base and offset,
| with flags or magic values for "share" and "kernel managed".
Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
comes in.
But are you saying that we should force x86 and other architectures to
specify base and offset for eclone() even though they currently specify
just the stack pointer to clone() ?
That would remove the ifdef, but could be a big change to applications
on x86 and other architectures.
|
| > | There is no reason to have field usage vary by architecture. The
| >
| > The field usage does not vary by architecture. Some architectures
| > don't use some fields and those fields must be 0.
|
| It looks like you contradict yourself. Please explain how
| those two sentences are compatible.
|
| > | original clone syscall was not designed with ia64 and hppa
| > | in mind, and has been causing trouble ever since. Let's not
| > | perpetuate the problem.
| >
| > and lot of folks contributed to this new API to try and make sure
| > it is portable and meets the forseeable requirements.
|
| Right, and some folks were ignored.
I don't think your comment was ignored. The ->child_stack_size field was
added specifically for IA64 and my understanding was that ->clone_flags_high
could be used to specify the "kernel managed" or "shared" mode you mention
above.
| >
| > I don't understand how "making up some numbers (pids) that will work"
| > is more portable/cleaner than the proposed eclone().
|
| It isolates the cross-platform problems to an obscure tool
| instead of polluting the kernel interface that everybody uses.
Sure, there was talk about using an approach like /proc/<pid>/next_pid
where you write your target pid into the file and the next time you
fork() you get that target pid. But it was considered racy and ugly.
Sukadev
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-06-02 1:38 ` Sukadev Bhattiprolu
@ 2010-06-05 11:49 ` Albert Cahalan
2010-06-05 11:58 ` Albert Cahalan
2010-06-05 12:08 ` Albert Cahalan
2 siblings, 0 replies; 14+ messages in thread
From: Albert Cahalan @ 2010-06-05 11:49 UTC (permalink / raw)
To: Sukadev Bhattiprolu; +Cc: linux-kernel, randy.dunlap, linuxppc-dev
On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?
Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).
This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.
> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.
It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.
> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.
Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.
You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.
For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.
For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.
Right now user code must be like this:
base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif
The man page is likewise messy.
Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-06-02 1:38 ` Sukadev Bhattiprolu
2010-06-05 11:49 ` Albert Cahalan
@ 2010-06-05 11:58 ` Albert Cahalan
2010-06-05 12:08 ` Albert Cahalan
2 siblings, 0 replies; 14+ messages in thread
From: Albert Cahalan @ 2010-06-05 11:58 UTC (permalink / raw)
To: Sukadev Bhattiprolu; +Cc: linux-kernel, randy.dunlap, linuxppc-dev
On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?
Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).
This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.
> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.
It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.
> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.
Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.
You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.
For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.
For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.
Right now user code must be like this:
base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif
The man page is likewise messy.
Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-06-02 1:38 ` Sukadev Bhattiprolu
2010-06-05 11:49 ` Albert Cahalan
2010-06-05 11:58 ` Albert Cahalan
@ 2010-06-05 12:08 ` Albert Cahalan
2010-06-09 18:14 ` Sukadev Bhattiprolu
2 siblings, 1 reply; 14+ messages in thread
From: Albert Cahalan @ 2010-06-05 12:08 UTC (permalink / raw)
To: Sukadev Bhattiprolu; +Cc: linux-kernel, randy.dunlap, linuxppc-dev
On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> | Come on, seriously, you know it's ia64 and hppa that
> | have issues. Maybe the nommu ports also have issues.
> |
> | The only portable way to specify the stack is base and offset,
> | with flags or magic values for "share" and "kernel managed".
>
> Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
> comes in.
>
> But are you saying that we should force x86 and other architectures to
> specify base and offset for eclone() even though they currently specify
> just the stack pointer to clone() ?
Even for x86, it's an easier API. Callers would be specifying
two numbers they already have: the argument and return value
for malloc. Currently the numbers must be added together,
destroying information, except on hppa (must not add size)
and ia64 (must use what I'm proposing already).
This also provides the opportunity for the kernel (perhaps not
in the initial implementation) to have a bit of extra info about
some processes. The info could be supplied to gdb, used to
harden the system against some types of security exploits,
presented in /proc, and so on.
> That would remove the ifdef, but could be a big change to applications
> on x86 and other architectures.
It's no change at all until somebody decides to use the new
system call. At that point, you're making changes anyway.
It's certainly not a big change compared to eclone() itself.
> | > I don't understand how "making up some numbers (pids) that will work"
> | > is more portable/cleaner than the proposed eclone().
> |
> | It isolates the cross-platform problems to an obscure tool
> | instead of polluting the kernel interface that everybody uses.
>
> Sure, there was talk about using an approach like /proc/<pid>/next_pid
> where you write your target pid into the file and the next time you
> fork() you get that target pid. But it was considered racy and ugly.
Oh, you misunderstood what I meant by making up numbers
and I didn't catch it. I wasn't meaning PID numbers. I was meaning
stack numbers for processes that your strange tool is restarting.
You ignored my long-ago request to use base/size to specify
the stack. My guess was that this was because you're focused
on restarting processes, many of which will lack stack base info.
I thus suggested that you handle this obscure legacy case by
making up some reasonable numbers.
For example, suppose a process allocates 0x40000000 to
0x7fffffff (a 1 GiB chunk) and uses 0x50000000 to 0x5fffffff as
a thread stack. If done using the old clone() syscall on i386,
you're only told that 0x5fffffff is the last stack address. You
know nothing of 0x50000000. Your tool can see the size and
base of the whole mapping though, so 0x40000000...0x5fffffff
is a reasonable place to assume the stack lives. You therefore
call eclone with base=0x40000000 size=0x2000000 when
restarting the process.
For everybody NOT writing an obscure tool to restart processes,
my requested change eliminates #ifdef mess and/or needless
failure to support some architectures.
Right now user code must be like this:
base=malloc(size);
#if defined(__hppa__)
tid=clone(fn,base,flags,arg);
#elif defined(__ia64__)
tid=clone2(fn,base,size,flags,arg);
#else
tid=clone(fn,base+size,flags,arg);
#endif
The man page is likewise messy.
Note that if clone2 were available for all architectures,
we wouldn't have this mess. Let's not perpetuate the
mistakes that led to the mess. Please provide an API
that, like clone2, uses base and size. It'll work for every
architecture. It'll even be less trouble to document.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-06-05 12:08 ` Albert Cahalan
@ 2010-06-09 18:14 ` Sukadev Bhattiprolu
2010-06-09 18:46 ` H. Peter Anvin
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2010-06-09 18:14 UTC (permalink / raw)
To: Albert Cahalan
Cc: linux-kernel, randy.dunlap, linuxppc-dev, hpa, roland, arnd
Albert Cahalan [acahalan@gmail.com] wrote:
| On Tue, Jun 1, 2010 at 9:38 PM, Sukadev Bhattiprolu
| <sukadev@linux.vnet.ibm.com> wrote:
| > | Come on, seriously, you know it's ia64 and hppa that
| > | have issues. Maybe the nommu ports also have issues.
| > |
| > | The only portable way to specify the stack is base and offset,
| > | with flags or magic values for "share" and "kernel managed".
| >
| > Ah, ok, we have not yet ported to IA64 and I see now where the #ifdef
| > comes in.
| >
| > But are you saying that we should force x86 and other architectures to
| > specify base and offset for eclone() even though they currently specify
| > just the stack pointer to clone() ?
|
| Even for x86, it's an easier API. Callers would be specifying
| two numbers they already have: the argument and return value
| for malloc. Currently the numbers must be added together,
| destroying information, except on hppa (must not add size)
| and ia64 (must use what I'm proposing already).
I agree its easier and would avoid #ifdefs in the applications.
Peter, Arnd, Roland - do you have any concerns with requiring all
architectures to specify the stack to eclone() as [base, offset]
To recap, currently we have
struct clone_args {
u64 clone_flags_high;
/*
* Architectures can use child_stack for either the stack pointer or
* the base of of stack. If child_stack is used as the stack pointer,
* child_stack_size must be 0. Otherwise child_stack_size must be
* set to size of allocated stack.
*/
u64 child_stack;
u64 child_stack_size;
u64 parent_tid_ptr;
u64 child_tid_ptr;
u32 nr_pids;
u32 reserved0;
};
sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
pid_t * __user pids)
Most architecutres would specify the stack pointer in ->child_stack and
ignore the ->child_stack_size.
IA64 specifies the *stack-base* in ->child_stack and the stack size in
->child_stack_size.
Albert and Randy point out that this would require #ifdefs in the
application code that intends to be portable across say IA64 and x86.
Can we instead have all architectures specify [base, size] ?
Thanks
Sukadev
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-06-09 18:14 ` Sukadev Bhattiprolu
@ 2010-06-09 18:46 ` H. Peter Anvin
2010-06-09 22:32 ` Roland McGrath
2010-06-10 9:15 ` Arnd Bergmann
2 siblings, 0 replies; 14+ messages in thread
From: H. Peter Anvin @ 2010-06-09 18:46 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: Albert Cahalan, linux-kernel, randy.dunlap, linuxppc-dev, roland,
arnd
On 06/09/2010 11:14 AM, Sukadev Bhattiprolu wrote:
> |
> | Even for x86, it's an easier API. Callers would be specifying
> | two numbers they already have: the argument and return value
> | for malloc. Currently the numbers must be added together,
> | destroying information, except on hppa (must not add size)
> | and ia64 (must use what I'm proposing already).
>
> I agree its easier and would avoid #ifdefs in the applications.
>
> Peter, Arnd, Roland - do you have any concerns with requiring all
> architectures to specify the stack to eclone() as [base, offset]
>
Makes sense to me. There might be advantages to be able to track the
size of the "stack allocation" even for other architectures, too.
-hpa
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-06-09 18:14 ` Sukadev Bhattiprolu
2010-06-09 18:46 ` H. Peter Anvin
@ 2010-06-09 22:32 ` Roland McGrath
2010-06-10 9:15 ` Arnd Bergmann
2 siblings, 0 replies; 14+ messages in thread
From: Roland McGrath @ 2010-06-09 22:32 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: Albert Cahalan, linux-kernel, randy.dunlap, linuxppc-dev, hpa,
arnd
> Peter, Arnd, Roland - do you have any concerns with requiring all
> architectures to specify the stack to eclone() as [base, offset]
I can't see why that would be a problem.
It's consistent with the sigaltstack interface we already have.
Thanks,
Roland
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-06-09 18:14 ` Sukadev Bhattiprolu
2010-06-09 18:46 ` H. Peter Anvin
2010-06-09 22:32 ` Roland McGrath
@ 2010-06-10 9:15 ` Arnd Bergmann
2 siblings, 0 replies; 14+ messages in thread
From: Arnd Bergmann @ 2010-06-10 9:15 UTC (permalink / raw)
To: Sukadev Bhattiprolu
Cc: Albert Cahalan, linux-kernel, randy.dunlap, linuxppc-dev, hpa,
roland
On Wednesday 09 June 2010, Sukadev Bhattiprolu wrote:
> Albert and Randy point out that this would require #ifdefs in the
> application code that intends to be portable across say IA64 and x86.
>
> Can we instead have all architectures specify [base, size] ?
No objections from me on that.
Arnd
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v21 00/100] Kernel based checkpoint/restart
@ 2010-05-01 14:14 Oren Laadan
2010-05-01 14:14 ` [PATCH v21 011/100] eclone (11/11): Document sys_eclone Oren Laadan
0 siblings, 1 reply; 14+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
To: Andrew Morton
Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
Pavel Emelyanov, Oren Laadan
Hi Andrew,
Here is the next version of the checkpoint/restart patchset. This
version moves portions of checkpoint code closer to where they belong.
As a convenience we've collected a rough table of contents showing
places to start for some reviewers with limited time and/or scope
(see below).
Thanks to Jamie, Nick, Andreas, and all who helped review the last few
versions, and thanks in advance for comments on this version.
We'll be very grateful if this can get a spin in -mm to get some wider
testing in the meantime.
Thanks,
The Checkpoint/Restart developers.
---
Linux Checkpoint-Restart:
web, wiki: http://www.linux-cr.org
bug track: https://www.linux-cr.org/redmine
The repositories for the project are in:
kernel: http://www.linux-cr.org/git/?p=linux-cr.git;a=summary
user tools: http://www.linux-cr.org/git/?p=user-cr.git;a=summary
tests suite: http://www.linux-cr.org/git/?p=tests-cr.git;a=summary
---
TABLE OF CONTENTS
Patches Area/Role
-------------------------------------------------------------------------
11,20 Documentation (eclone, c/r)
8-11,21,22,27,28 Syscall gluey bits
12 Arch Maintainers
8,22-24 x86-32/64
9,58,60 s390
10,84-88 powerpc
14,61-63,69,70, Security
71,89-92,
33,34,35 Generic c/r
(shared "object" hash, leak detection, deferqueues)
25,27-31 Processes
5-7 fork (eclone)
39-41,45,46 memory
13,18,51,52,54, namespaces
81-83,94
53-57 ipc
64-67 signals
1-4,70,83 pids, pgids, tids, tgids (eclone or pidns)
14,61,62,69 creds, capabilities, uids, gids
71 sockets
76-78 terminals (specifically pty)
27,28,32 futexes (27,28 relate to futex syscalls restart)
39-41,45,46,55 mm (basically process memory)
15-17 Cgroups
71-75,93-99 Networking
19,36-38,42-44, Filesystems (also pseudo-filesystems, anon_inodes)
47-50,63,76-77,
79-82
Some patches show up in multiple places because they are functionally
related even though they cross Area/Role boundaries. While we've done our
best to make the table above comprehensive, it's entirely conceivable that
we've neglected a small piece of a largely unrelated patch. Please feel
free to point these out to Matt Helsley <matthltc@us.ibm.com> since he's
largely responsible for this table.
---
CHANGELOG:
[2010-Apr-30] v21
- Add relevant maintainers/lists as Cc: in patch descriptions
- Reorganize code: move checkpoint/* to kernel/checkpoint/*
- Reorganize filesystem code into fs/*
- Merge files dump/restore into a single patch
- Merge mm dump/restore into a single patch
- Move utsns c/r code from checkpoint/namespace.c to kernel/utsname*.c
- [Matt Helsley] Move the signal c/r changes to kernel/signal.c
- Move userns c/r code from to kernel/{user,cred,user_namespace}.c
- Assorted fixes to bisectability of patchset
- Do not include checkpoint_hdr.h explicitly
- Subsystems/modules register shared objects types for c/r
- [Serge Hallyn] CONFIG_SECURITY_FILE_CAPABILITIES has been gone awhile
- [Dan Smith] Unbreak compiling with CONFIG_CHECKPOINT=n or CONFIG_NET_NS=n
- [Dan Smith] Clean up the error path in restore_veth()
- [Dan Smith] Fix acquiring socket lock before reading RTNETLINK response
- [Dan Smith] Skip down interfaces (v2)
- [Dan Smith] Export net checkpoint fns
- [Dan Smith] Add CHECKPOINT_NETNS flag
- [Dan Smith] Netdev restore function dispatching from a table
- [Dan Smith] Comment on controverial determination of "initial netns"
- [Dan Smith] Simplify the E2BIG error handling in netdev c/r
- [Dan Smith] Remove a redundant check for checkpoint support per-device
- [Nathan Lynch] powerpc: fix build break with CONFIG_CHECKPOINT=n
- [Matt Helsley] Eventfd: add missing spin locks around eventfd checkpoint
- [Matt Helsley] Put file_ops->checkpoint under CONFIG_CHECKPOINT
- [Dan Smith] Fix build when CONFIG_INET=n
- [Dan Smith] Disable softirqs when taking the socket queue lock
- Replace __initcall() with late_initcall()
- [Serge Hallyn] Remove [] following individual ops definitions.
- [Serge Hallyn] Fix compilation for when CONFIG_USER_NS=y
- [Serge Hallyn] handle CONFIG_{SYSVIPC,SYSVIPC,POSIX_MQUEUE}=n
- [Serge Hallyn] Remove namespace.o from kernel/checkpoint/Makefile
- [Stanislav O. Bezzubtsev] Fix omitted parameter name error
- Put file_ops->checkpoint under CONFIG_CHECKPOINT
- [Serge] Print out full path of file which crossed mnt_ns
- Update Documentation/filesystem/vfs.txt
- Restore_obj() to tolerate a preexisting object in the hash
- Add ckpt_obj_del() to objhash for handling error conditions
- [Serge Hallyn] Replace BUG_ON() in obj_new with error returns
- [Matt Helsley] Move CKPT_CTX_ERROR* definitions to first use.
- [Nathan Lynch] x86: use task_user_gs to checkpoint gs
- Complain if checkpoint_hdr.h included without CONFIG_CHECKPOINT
- Introduce kernel_write(), fix kernel_read()
- Consolidate ckpt_read/write with kernel_read/write
- [Christoffer Dall] Fix trivial bug in ckpt_msg macro
- [Serge Hallyn] user/group: address dhowells feedback
[2010-Mar-16] v20
BUG FIXES (only)
- [Serge Hallyn] Fix unlabeled restore case
- [Serge Hallyn] Always restore msg_msg label
- [Serge Hallyn] Selinux prevents msgrcv on restore message queues?
- [Serge Hallyn] save_access_regs for self-checkpoint
- [Serge Hallyn] send uses_interp=1 to arch_setup_additional_pages
- Fix "scheduling in atomic" while restoring ipc (sem, shm, msg)
- Cleanup: no need to restore perm->{id,key,seq}
- Fix sysvipc=n compile
- Make uts_ns=n compile
- Only use arch_setup_additional_pages() if supported by arch
- Export key symbols to enable c/r from kernel modules
- Avoid crash if incoming object doesn't have .restore
- Replace error_sem with an event completion
- [Serge Hallyn] Change sysctl and default for unprivileged use
- [Nathan Lynch] Use syscall_get_error
- Add entry for checkpoint/restart in MAINTAINERS
[2010-Feb-19] v19
NEW FEATURES
- Support for x86-64 architecture
- Support for c/r of LSM (smack, selinux)
- Support for c/r of task fs_root and pwd
- Support for c/r of epoll
- Support for c/r of eventfd
- Enable C/R while executing over NFS
- Preliminary c/r of mounts namespace
- Add @logfd argument to sys_{checkpoint,restart} prototypes
- Define new api for error and debug logging
- Restart to handle checkpoint images lacking {uts,ipc}-ns
- Refuse to checkpoint if monitoring directories with dnotify
- Refuse to checkpoint if file locks and leases are held
- Refuse to checkpoint files with f_owner
OTHER CHANGES
- Rebase to kernel 2.6.33-rc8
- Settled version of new sys_eclone()
- [Serge Hallyn] Fix potential use-before-set return (vdso)
- Update documentation and examples for new syscalls API (doc)
- [Liu Alexander] Fix typos (doc)
- [Serge Hallyn] Update checkpoint image format (doc)
- [Serge Hallyn] Use ckpt_err() to for bad header values
- sys_{checkpoint,restart} to use ptregs prototype
- Set ctx->errno in do_ckpt_msg() if needed
- Fix up headers so we can munge them for use by userspace
- Multiple fixes to _ckpt_write_err() and friends
- [Matt Helsley] Add cpp definitions for enums
- [Serge Hallyn] Add global section container to image format
- [Matt Helsley] Fix total byte read/write count for large images
- ckpt_read_buf_type() to accept max payload (excludes ckpt_hdr)
- [Serge Hallyn] Use ckpt_err() for arch incompatbilities
- Introduce walk_task_subtree() to iterate through descendants
- Call restore_notify_error for restart (not checkpoint !)
- Make kread/kwrite() abort if CKPT_CTX_ERROR is set
- [Serge Hallyn] Move init_completion(&ctx->complete) to ctx_alloc
- Simplify logic of tracking restarting tasks (->ctx)
- Coordinator kills descendants on failure for proper cleanup
- Prepare descendants needs PTRACE_MODE_ATTACH permissions
- Threads wait for entire thread group before restoring
- Add debug process-tree status during restart
- Fix handling of bogus pid arg to sys_restart
- In reparent_thread() test for PF_RESTARTING on parent
- Keep __u32s in even groups for 32-64 bit compatibility
- Define ckpt_obj_try_fetch
- Disallow zero or negative objref during restart
- Check for valid destructor before calling it (deferqueue)
- Fix false negative of test for unlinked files at checkpoint
- [Serge Hallyn] Rename fs_mnt to root_fs_path
- Restore thread/cpu state early
- Ensure null-termination of file names read from image
- Fix compile warning in restore_open_fname()
- Introduce FOLL_DIRTY to follow_page() for "dirty" pages
- [Serge Hallyn] Checkpoint saved_auxv as u64s
- Export filemap_checkpoint()
- [Serge Hallyn] Disallow checkpoint of tasks with aio requests
- Fix compilation failure when !CONFIG_CHEKCPOINT (regression)
- Expose page write functions
- Do not hold mmap_sem while checkpointing vma's
- Do not hold mmap_sem when reading memory pages on restart
- Move consider_private_page() to mm/memory.c:__get_dirty_page()
- [Serge Hallyn] move destroy_mm into mmap.c and remove size check
- [Serge Hallyn] fill vdso (syscall32_setup_pages) for TIF_IA32/x86_64
- [Serge Hallyn] Fix return value of read_pages_contents()
- [Serge Hallyn] Change m_type to long, not int (ipc)
- Don't free sma if it's an error on restore
- Use task->saves_sigmask and drop task->checkpoint_data
- [Serge Hallyn] Handle saved_sigmask at checkpoint
- Defer restore of blocked signals mask during restart
- Self-restart to tolerate missing PGIDs
- [Serge Hallyn] skb->tail can be offset
- Export and leverage sock_alloc_file()
- [Nathan Lynch] Fix net/checkpoint.c for 64-bit
- [Dan Smith] Unify skb read/write functions and handle fragmented buffers
- [Dan Smith] Update buffer restore code to match the new format
- [Dan Smith] Fix compile issue with CONFIG_CHECKPOINT=n
- [Dan Smith] Remove an unnecessary check on socket restart
- [Dan Smith] Pass the stored sock->protocol into sock_create() on restore
- Relax tcp.window_clamp value in INET restore
- Restore gso_type fields on sockets and buffers for proper operation
- Fix broken compilation for no-c/r architectures
- Return -EBUSY (not BUG_ON) if fd is gone on restart
- Fix the chunk size instead of auto-tune (epoll)
ARCH: x86 (32,64)
- Use PTREGSCALL4 for sys_{checkpoint,restart}
- Remove debug-reg support (need to redo with perf_events)
- [Serge Hallyn] Support for ia32 (checkpoint, restart)
- Split arch/x86/checkpoint.c to generic and 32bit specific parts
- sys_{checkpoint,restore} to use ptregs
- Allow X86_EFLAGS_RF on restart
- [Serge Hallyn] Only allow 'restart' with same bit-ness as image.
- Move checkpoint.c from arch/x86/mm->arch/x86/kernel
ARCH: s390 [Serge Hallyn]
- Define s390x sys_restart wrapper
- Fixes to restart-blocks logic and signal path
- Fix checkpoint and restart compat wrappers
- sys_{checkpoint,restore} to use ptregs
- Use simpler test_task_thread to test current ti flags
- Fix 31-bit s390 checkpoint/restart wrappers
- Update sys_checkpoint (do_sys_checkpoint on all archs)
- [Oren Laadan] Move checkpoint.c from arch/s390/mm->arch/s390/kernel
ARCH: powerpc [Nathan Lynch]
- [Serge Hallyn] Add hook task_has_saved_sigmask()
- Warn if full register state unavailable
- Fix up checkpoint syscall, tidy restart
- [Oren Laadan] Move checkpoint.c from arch/powerpc/{mm->kernel}
[2009-Sep-22] v18
NEW FEATURES
- [Nathan Lynch] Re-introduce powerpc support
- Save/restore pseudo-terminals
- Save/restore (pty) controlling terminals
- Save/restore restore PGIDs
- [Dan Smith] Save/restore unix domain sockets
- Save/restore FIFOs
- Save/restore pending signals
- Save/restore rlimits
- Save/restore itimers
- [Matt Helsley] Handle many non-pseudo file-systems
OTHER CHANGES
- Rename headerless struct ckpt_hdr_* to struct ckpt_*
- [Nathan Lynch] discard const from struct cred * where appropriate
- [Serge Hallyn][s390] Set return value for self-checkpoint
- Handle kmalloc failure in restore_sem_array()
- [IPC] Collect files used by shm objects
- [IPC] Use file (not inode) as shared object on checkpoint of shm
- More ckpt_write_err()s to give information on checkpoint failure
- Adjust format of pipe buffer to include the mandatory pre-header
- [LEAKS] Mark the backing file as visited at chekcpoint
- Tighten checks on supported vma to checkpoint or restart
- [Serge Hallyn] Export filemap_checkpoint() (used for ext4)
- Introduce ckpt_collect_file() that also uses file->collect method
- Use ckpt_collect_file() instead of ckpt_obj_collect() for files
- Fix leak-detection issue in collect_mm() (test for first-time obj)
- Invoke set_close_on_exec() unconditionally on restart
- [Dan Smith] Export fill_fname() as ckpt_fill_fname()
- Interface to pass simple pointers as data with deferqueue
- [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic
- Replace EAGAIN with EBUSY where necessary
- Introduce CKPT_OBJ_VISITED in leak detection
- ckpt_obj_collect() returns objref for new objects, 0 otherwise
- Rename ckpt_obj_checkpointed() to ckpt_obj_visited()
- Introduce ckpt_obj_visit() to mark objects as visited
- Set the CHECKPOINTED flag on objects before calling checkpoint
- Introduce ckpt_obj_reserve()
- Change ref_drop() to accept a @lastref argument (for cleanup)
- Disallow multiple objects with same objref in restart
- Allow _ckpt_read_obj_type() to read header only (w/o payload)
- Fix leak of ckpt_ctx when restoring zombie tasks
- Fix race of prepare_descendant() with an ongoing fork()
- Track and report the first error if restart fails
- Tighten logic to protect against bogus pids in input
- [Matt Helsley] Improve debug output from ckpt_notify_error()
- [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y
- Detect error-headers in input data on restart, and abort.
- Standard format for checkpoint error strings (and documentation)
- [Dan Smith] Add an errno validation function
- Add ckpt_read_payload(): read a variable-length object (no header)
- Add ckpt_read_string(): same for strings (ensures null-terminated)
- Add ckpt_read_consume(): consumes next object without processing
- [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
[2009-Jul-21] v17
- Introduce syscall clone_with_pids() to restore original pids
- Support threads and zombies
- Save/restore task->files
- Save/restore task->sighand
- Save/restore futex
- Save/restore credentials
- Introduce PF_RESTARTING to skip notifications on task exit
- restart(2) allow caller to ask to freeze tasks after restart
- restart(2) isn't idempotent: return -EINTR if interrupted
- Improve debugging output handling
- Make multi-process restart logic more robust and complete
- Correctly select return value for restarting tasks on success
- Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH
- Use CHECKPOINTING state for frozen checkpointed tasks
- Fix compilation without CONFIG_CHECKPOINT
- Fix compilation with CONFIG_COMPAT
- Fix headers includes and exports
- Leak detection performed in two steps
- Detect "inverse" leaks of objects (dis)appearing unexpectedly
- Memory: save/restore mm->{flags,def_flags,saved_auxv}
- Memory: only collect sub-objects of mm once (leak detection)
- Files: validate f_mode after restore
- Namespaces: leak detection for nsproxy sub-components
- Namespaces: proper restart from namespace(s) without namespace(s)
- Save global constants in header instead of per-object
- IPC: replace sys_unshare() with create_ipc_ns()
- IPC: restore objects in suitable namespace
- IPC: correct behavior under !CONFIG_IPC_NS
- UTS: save/restore all fields
- UTS: replace sys_unshare() with create_uts_ns()
- X86_32: sanitize cpu, debug, and segment registers on restart
- cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint
- cgroup_freezer: add interface to freeze a cgroup (given a task)
[2009-May-27] v16
- Privilege checks for IPC checkpoint
- Fix error string generation during checkpoint
- Use kzalloc for header allocation
- Restart blocks are arch-independent
- Redo pipe c/r using splice
- Fixes to s390 arch
- Remove powerpc arch (temporary)
- Explicitly restore ->nsproxy
- All objects in image are precedeed by 'struct ckpt_hdr'
- Fix leaks detection (and leaks)
- Reorder of patchset
- Misc bugs and compilation fixes
[2009-Apr-12] v15
- Minor fixes
[2009-Apr-28] v14
- Tested against kernel v2.6.30-rc3 on x86_32.
- Refactor files chekpoint to use f_ops (file operations)
- Refactor mm/vma to use vma_ops
- Explicitly handle VDSO vma (and require compat mode)
- Added code to c/r restat-blocks (restart timeout related syscalls)
- Added code to c/r namespaces: uts, ipc (with Dan Smith)
- Added code to c/r sysvipc (shm, msg, sem)
- Support for VM_CLONE shared memory
- Added resource leak detection for whole-container checkpoint
- Added sysctl gauge to allow unprivileged restart/checkpoint
- Improve and simplify the code and logic of shared objects
- Rework image format: shared objects appear prior to their use
- Merge checkpoint and restart functionality into same files
- Massive renaming of functions: prefix "ckpt_" for generics,
"checkpoint_" for checkpoint, and "restore_" for restart.
- Report checkpoint errors as a valid (string record) in the output
- Merged PPC architecture (by Nathan Lunch),
- Requires updates to userspace tools too.
- Misc nits and bug fixes
[2009-Mar-31] v14-rc2
- Change along Dave's suggestion to use f_ops->checkpoint() for files
- Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT
- Merge support for PPC arch (Nathan Lynch)
- Misc cleanups and fixes in response to comments
[2009-Mar-20] v14-rc1:
- The 'h.parent' field of 'struct cr_hdr' isn't used - discard
- Check whether calls to cr_hbuf_get() succeed or fail.
- Fixed of pipe c/r code
- Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode
- Refuse non-self checkpoint if a task isn't frozen
- Use unsigned fields in checkpoint headers unless otherwise required
- Rename functions in files c/r to better reflect their role
- Add support for anonymous shared memory
- Merge support for s390 arch (Dan Smith, Serge Hallyn)
[2008-Dec-03] v13:
- Cleanups of 'struct cr_ctx' - remove unused fields
- Misc fixes for comments
[2008-Dec-17] v12:
- Fix re-alloc/reset of pgarr chain to correctly reuse buffers
(empty pgarr are saves in a separate pool chain)
- Add a couple of missed calls to cr_hbuf_put()
- cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
- Split cr_write/cr_read() to two parts: _cr_write/read() helper
- Befriend with sparse: explicit conversion to 'void __user *'
- Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()
[2008-Dec-05] v11:
- Use contents of 'init->fs->root' instead of pointing to it
- Ignore symlinks (there is no such thing as an open symlink)
- cr_scan_fds() retries from scratch if it hits size limits
- Add missing test for VM_MAYSHARE when dumping memory
- Improve documentation about: behavior when tasks aren't fronen,
life span of the object hash, references to objects in the hash
[2008-Nov-26] v10:
- Grab vfs root of container init, rather than current process
- Acquire dcache_lock around call to __d_path() in cr_fill_name()
- Force end-of-string in cr_read_string() (fix possible DoS)
- Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
[2008-Nov-10] v9:
- Support multiple processes c/r
- Extend checkpoint header with archtiecture dependent header
- Misc bug fixes (see individual changelogs)
- Rebase to v2.6.28-rc3.
[2008-Oct-29] v8:
- Support "external" checkpoint
- Include Dave Hansen's 'deny-checkpoint' patch
- Split docs in Documentation/checkpoint/..., and improve contents
[2008-Oct-17] v7:
- Fix save/restore state of FPU
- Fix argument given to kunmap_atomic() in memory dump/restore
[2008-Oct-07] v6:
- Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
(even though it's not really needed)
- Add assumptions and what's-missing to documentation
- Misc fixes and cleanups
[2008-Sep-11] v5:
- Config is now 'def_bool n' by default
- Improve memory dump/restore code (following Dave Hansen's comments)
- Change dump format (and code) to allow chunks of <vaddrs, pages>
instead of one long list of each
- Fix use of follow_page() to avoid faulting in non-present pages
- Memory restore now maps user pages explicitly to copy data into them,
instead of reading directly to user space; got rid of mprotect_fixup()
- Remove preempt_disable() when restoring debug registers
- Rename headers files s/ckpt/checkpoint/
- Fix misc bugs in files dump/restore
- Fixes and cleanups on some error paths
- Fix misc coding style
[2008-Sep-09] v4:
- Various fixes and clean-ups
- Fix calculation of hash table size
- Fix header structure alignment
- Use stand list_... for cr_pgarr
[2008-Aug-29] v3:
- Various fixes and clean-ups
- Use standard hlist_... for hash table
- Better use of standard kmalloc/kfree
[2008-Aug-20] v2:
- Added Dump and restore of open files (regular and directories)
- Added basic handling of shared objects, and improve handling of
'parent tag' concept
- Added documentation
- Improved ABI, 64bit padding for image data
- Improved locking when saving/restoring memory
- Added UTS information to header (release, version, machine)
- Cleanup extraction of filename from a file pointer
- Refactor to allow easier reviewing
- Remove requirement for CAPS_SYS_ADMIN until we come up with a
security policy (this means that file restore may fail)
- Other cleanup and response to comments for v1
[2008-Jul-29] v1:
- Initial version: support a single task with address space of only
private anonymous or file-mapped VMAs; syscalls ignore pid/crid
argument and act on current process.
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-05-01 14:14 [PATCH v21 00/100] Kernel based checkpoint/restart Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
2010-05-05 21:14 ` Randy Dunlap
0 siblings, 1 reply; 14+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
To: Andrew Morton
Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
Pavel Emelyanov, Sukadev Bhattiprolu, linux-api, x86, linux-s390,
linuxppc-dev
From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
This gives a brief overview of the eclone() system call. We should
eventually describe more details in existing clone(2) man page or in
a new man page.
Changelog[v13]:
- [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to
->child_stack and ensure ->child_stack_size is 0 on architectures
that don't need it.
- [Arnd Bergmann] Remove ->reserved1 field
- [Louis Rilling, Dave Hansen] Combine the two asm statements in the
example into one and use memory constraint to avoid unncessary copies.
Changelog[v12]:
- [Serge Hallyn] Fix/simplify stack-setup in the example code
- [Serge Hallyn, Oren Laadan] Rename syscall to eclone()
Changelog[v11]:
- [Dave Hansen] Move clone_args validation checks to arch-indpendent
code.
- [Oren Laadan] Make args_size a parameter to system call and remove
it from 'struct clone_args'
- [Oren Laadan] Fix some typos and clarify the order of pids in the
@pids parameter.
Changelog[v10]:
- Rename clone3() to clone_with_pids() and fix some typos.
- Modify example to show usage with the ptregs implementation.
Changelog[v9]:
- [Pavel Machek]: Fix an inconsistency and rename new file to
Documentation/clone3.
- [Roland McGrath, H. Peter Anvin] Updates to description and
example to reflect new prototype of clone3() and the updated/
renamed 'struct clone_args'.
Changelog[v8]:
- clone2() is already in use in IA64. Rename syscall to clone3()
- Add notes to say that we return -EINVAL if invalid clone flags
are specified or if the reserved fields are not 0.
Changelog[v7]:
- Rename clone_with_pids() to clone2()
- Changes to reflect new prototype of clone2() (using clone_struct).
Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
Documentation/eclone | 348 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 348 insertions(+), 0 deletions(-)
create mode 100644 Documentation/eclone
diff --git a/Documentation/eclone b/Documentation/eclone
new file mode 100644
index 0000000..c2f1b4b
--- /dev/null
+++ b/Documentation/eclone
@@ -0,0 +1,348 @@
+
+struct clone_args {
+ u64 clone_flags_high;
+ u64 child_stack;
+ u64 child_stack_size;
+ u64 parent_tid_ptr;
+ u64 child_tid_ptr;
+ u32 nr_pids;
+ u32 reserved0;
+};
+
+
+sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
+ pid_t * __user pids)
+
+ In addition to doing everything that clone() system call does, the
+ eclone() system call:
+
+ - allows additional clone flags (31 of 32 bits in the flags
+ parameter to clone() are in use)
+
+ - allows user to specify a pid for the child process in its
+ active and ancestor pid namespaces.
+
+ This system call is meant to be used when restarting an application
+ from a checkpoint. Such restart requires that the processes in the
+ application have the same pids they had when the application was
+ checkpointed. When containers are nested, the processes within the
+ containers exist in multiple pid namespaces and hence have multiple
+ pids to specify during restart.
+
+ The @flags_low parameter is identical to the 'clone_flags' parameter
+ in existing clone() system call.
+
+ The fields in 'struct clone_args' are meant to be used as follows:
+
+ u64 clone_flags_high:
+
+ When eclone() supports more than 32 flags, the additional bits
+ in the clone_flags should be specified in this field. This
+ field is currently unused and must be set to 0.
+
+ u64 child_stack;
+ u64 child_stack_size;
+
+ These two fields correspond to the 'child_stack' fields in
+ clone() and clone2() (on IA64) system calls. The usage of
+ these two fields depends on the processor architecture.
+
+ Most architectures use ->child_stack to pass-in a stack-pointer
+ itself and don't need the ->child_stack_size field. On these
+ architectures the ->child_stack_size field must be 0.
+
+ Some architectures, eg IA64, use ->child_stack to pass-in the
+ base of the region allocated for stack. These architectures
+ must pass in the size of the stack-region in ->child_stack_size.
+
+ u64 parent_tid_ptr;
+ u64 child_tid_ptr;
+
+ These two fields correspond to the 'parent_tid_ptr' and
+ 'child_tid_ptr' fields in the clone() system call
+
+ u32 nr_pids;
+
+ nr_pids specifies the number of pids in the @pids array
+ parameter to eclone() (see below). nr_pids should not exceed
+ the current nesting level of the calling process (i.e if the
+ process is in init_pid_ns, nr_pids must be 1, if process is
+ in a pid namespace that is a child of init-pid-ns, nr_pids
+ cannot exceed 2, and so on).
+
+ u32 reserved0;
+ u64 reserved1;
+
+ These fields are intended to extend the functionality of the
+ eclone() in the future, while preserving backward compatibility.
+ They must be set to 0 for now.
+
+ The @cargs_size parameter specifes the sizeof(struct clone_args) and
+ is intended to enable extending this structure in the future, while
+ preserving backward compatibility. For now, this field must be set
+ to the sizeof(struct clone_args) and this size must match the kernel's
+ view of the structure.
+
+ The @pids parameter defines the set of pids that should be assigned to
+ the child process in its active and ancestor pid namespaces. The
+ descendant pid namespaces do not matter since a process does not have a
+ pid in descendant namespaces, unless the process is in a new pid
+ namespace in which case the process is a container-init (and must have
+ the pid 1 in that namespace).
+
+ See CLONE_NEWPID section of clone(2) man page for details about pid
+ namespaces.
+
+ If a pid in the @pids list is 0, the kernel will assign the next
+ available pid in the pid namespace.
+
+ If a pid in the @pids list is non-zero, the kernel tries to assign
+ the specified pid in that namespace. If that pid is already in use
+ by another process, the system call fails (see EBUSY below).
+
+ The order of pids in @pids is oldest in pids[0] to youngest pid
+ namespace in pids[nr_pids-1]. If the number of pids specified in the
+ @pids list is fewer than the nesting level of the process, the pids
+ are applied from youngest namespace. i.e if the process is nested in
+ a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
+ are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
+ have a pid of '0' (the kernel will assign a pid in those namespaces).
+
+ On success, the system call returns the pid of the child process in
+ the parent's active pid namespace.
+
+ On failure, eclone() returns -1 and sets 'errno' to one of following
+ values (the child process is not created).
+
+ EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to
+ specify the pids in this call (if pids are not specifed
+ CAP_SYS_ADMIN is not required).
+
+ EINVAL The number of pids specified in 'clone_args.nr_pids' exceeds
+ the current nesting level of parent process
+
+ EINVAL Not all specified clone-flags are valid.
+
+ EINVAL The reserved fields in the clone_args argument are not 0.
+
+ EINVAL The child_stack_size field is not 0 (on architectures that
+ pass in a stack pointer in ->child_stack field)
+
+ EBUSY A requested pid is in use by another process in that namespace.
+
+---
+/*
+ * Example eclone() usage - Create a child process with pid CHILD_TID1 in
+ * the current pid namespace. The child gets the usual "random" pid in any
+ * ancestor pid namespaces.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <errno.h>
+#include <unistd.h>
+#include <wait.h>
+#include <sys/syscall.h>
+
+#define __NR_eclone 337
+#define CLONE_NEWPID 0x20000000
+#define CLONE_CHILD_SETTID 0x01000000
+#define CLONE_PARENT_SETTID 0x00100000
+#define CLONE_UNUSED 0x00001000
+
+#define STACKSIZE 8192
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef int pid_t;
+struct clone_args {
+ u64 clone_flags_high;
+ u64 child_stack;
+ u64 child_stack_size;
+
+ u64 parent_tid_ptr;
+ u64 child_tid_ptr;
+
+ u32 nr_pids;
+
+ u32 reserved0;
+};
+
+#define exit _exit
+
+/*
+ * Following eclone() is based on code posted by Oren Laadan at:
+ * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
+ */
+#if defined(__i386__) && defined(__NR_eclone)
+
+int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
+ int *pids)
+{
+ long retval;
+
+ __asm__ __volatile__(
+ "movl %3, %%ebx\n\t" /* flags_low -> 1st (ebx) */
+ "movl %4, %%ecx\n\t" /* clone_args -> 2nd (ecx)*/
+ "movl %5, %%edx\n\t" /* args_size -> 3rd (edx) */
+ "movl %6, %%edi\n\t" /* pids -> 4th (edi)*/
+
+ "pushl %%ebp\n\t" /* save value of ebp */
+ "int $0x80\n\t" /* Linux/i386 system call */
+ "testl %0,%0\n\t" /* check return value */
+ "jne 1f\n\t" /* jump if parent */
+
+ "popl %%esi\n\t" /* get subthread function */
+ "call *%%esi\n\t" /* start subthread function */
+ "movl %2,%0\n\t"
+ "int $0x80\n" /* exit system call: exit subthread */
+ "1:\n\t"
+ "popl %%ebp\t" /* restore parent's ebp */
+
+ :"=a" (retval)
+
+ :"0" (__NR_eclone),
+ "i" (__NR_exit),
+ "m" (flags_low),
+ "m" (clone_args),
+ "m" (args_size),
+ "m" (pids)
+ );
+
+ if (retval < 0) {
+ errno = -retval;
+ retval = -1;
+ }
+ return retval;
+}
+
+/*
+ * Allocate a stack for the clone-child and arrange to have the child
+ * execute @child_fn with @child_arg as the argument.
+ */
+void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
+{
+ void *stack_base;
+ void **stack_top;
+
+ stack_base = malloc(size + size);
+ if (!stack_base) {
+ perror("malloc()");
+ exit(1);
+ }
+
+ stack_top = (void **)((char *)stack_base + (size - 4));
+ *--stack_top = child_arg;
+ *--stack_top = child_fn;
+
+ return stack_top;
+}
+#endif
+
+/* gettid() is a bit more useful than getpid() when messing with clone() */
+int gettid()
+{
+ int rc;
+
+ rc = syscall(__NR_gettid, 0, 0, 0);
+ if (rc < 0) {
+ printf("rc %d, errno %d\n", rc, errno);
+ exit(1);
+ }
+ return rc;
+}
+
+#define CHILD_TID1 377
+#define CHILD_TID2 1177
+#define CHILD_TID3 2799
+
+struct clone_args clone_args;
+void *child_arg = &clone_args;
+int child_tid;
+
+int do_child(void *arg)
+{
+ struct clone_args *cs = (struct clone_args *)arg;
+ int ctid;
+
+ /* Verify we pushed the arguments correctly on the stack... */
+ if (arg != child_arg) {
+ printf("Child: Incorrect child arg pointer, expected %p,"
+ "actual %p\n", child_arg, arg);
+ exit(1);
+ }
+
+ /* ... and that we got the thread-id we expected */
+ ctid = *((int *)(unsigned long)cs->child_tid_ptr);
+ if (ctid != CHILD_TID1) {
+ printf("Child: Incorrect child tid, expected %d, actual %d\n",
+ CHILD_TID1, ctid);
+ exit(1);
+ } else {
+ printf("Child got the expected tid, %d\n", gettid());
+ }
+ sleep(2);
+
+ printf("[%d, %d]: Child exiting\n", getpid(), ctid);
+ exit(0);
+}
+
+static int do_clone(int (*child_fn)(void *), void *child_arg,
+ unsigned int flags_low, int nr_pids, pid_t *pids_list)
+{
+ int rc;
+ void *stack;
+ struct clone_args *ca = &clone_args;
+ int args_size;
+
+ stack = setup_stack(child_fn, child_arg, STACKSIZE);
+
+ memset(ca, 0, sizeof(*ca));
+
+ ca->child_stack = (u64)(unsigned long)stack;
+ ca->child_stack_size = (u64)0;
+ ca->child_tid_ptr = (u64)(unsigned long)&child_tid;
+ ca->nr_pids = nr_pids;
+
+ args_size = sizeof(struct clone_args);
+ rc = eclone(flags_low, ca, args_size, pids_list);
+
+ printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
+ rc, errno);
+ return rc;
+}
+
+/*
+ * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
+ * The test case creates a child in the current pid namespace and uses only
+ * the first value, CHILD_TID1.
+ */
+pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
+int main()
+{
+ int rc, pid, status;
+ unsigned long flags;
+ int nr_pids = 1;
+
+ flags = SIGCHLD|CLONE_CHILD_SETTID;
+
+ pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
+
+ printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
+
+ rc = waitpid(pid, &status, __WALL);
+ if (rc < 0) {
+ printf("waitpid(): rc %d, error %d\n", rc, errno);
+ } else {
+ printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
+ gettid(), rc, status);
+
+ if (WIFEXITED(status)) {
+ printf("\t EXITED, %d\n", WEXITSTATUS(status));
+ } else if (WIFSIGNALED(status)) {
+ printf("\t SIGNALED, %d\n", WTERMSIG(status));
+ }
+ }
+ return 0;
+}
--
1.6.3.3
^ permalink raw reply related [flat|nested] 14+ messages in thread* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-05-01 14:14 ` [PATCH v21 011/100] eclone (11/11): Document sys_eclone Oren Laadan
@ 2010-05-05 21:14 ` Randy Dunlap
2010-05-05 22:25 ` Sukadev Bhattiprolu
0 siblings, 1 reply; 14+ messages in thread
From: Randy Dunlap @ 2010-05-05 21:14 UTC (permalink / raw)
To: Oren Laadan
Cc: Andrew Morton, containers, linux-kernel, Serge Hallyn,
Matt Helsley, Pavel Emelyanov, Sukadev Bhattiprolu, linux-api,
x86, linux-s390, linuxppc-dev
On Sat, 1 May 2010 10:14:53 -0400 Oren Laadan wrote:
> From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
>
> This gives a brief overview of the eclone() system call. We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.
>
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Acked-by: Serge E. Hallyn <serue@us.ibm.com>
> Acked-by: Oren Laadan <orenl@cs.columbia.edu>
> ---
> Documentation/eclone | 348 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 348 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/eclone
>
> diff --git a/Documentation/eclone b/Documentation/eclone
> new file mode 100644
> index 0000000..c2f1b4b
> --- /dev/null
> +++ b/Documentation/eclone
> @@ -0,0 +1,348 @@
> +
> +struct clone_args {
> + u64 clone_flags_high;
> + u64 child_stack;
> + u64 child_stack_size;
> + u64 parent_tid_ptr;
> + u64 child_tid_ptr;
> + u32 nr_pids;
> + u32 reserved0;
> +};
> +
> +
> +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
> + pid_t * __user pids)
> +
> + In addition to doing everything that clone() system call does, the
that the clone()
> + eclone() system call:
> +
> + - allows additional clone flags (31 of 32 bits in the flags
> + parameter to clone() are in use)
> +
> + - allows user to specify a pid for the child process in its
> + active and ancestor pid namespaces.
> +
> + This system call is meant to be used when restarting an application
> + from a checkpoint. Such restart requires that the processes in the
> + application have the same pids they had when the application was
> + checkpointed. When containers are nested, the processes within the
> + containers exist in multiple pid namespaces and hence have multiple
> + pids to specify during restart.
> +
> + The @flags_low parameter is identical to the 'clone_flags' parameter
> + in existing clone() system call.
in the existing
> +
> + The fields in 'struct clone_args' are meant to be used as follows:
> +
> + u64 clone_flags_high:
> +
> + When eclone() supports more than 32 flags, the additional bits
> + in the clone_flags should be specified in this field. This
> + field is currently unused and must be set to 0.
> +
> + u64 child_stack;
> + u64 child_stack_size;
> +
> + These two fields correspond to the 'child_stack' fields in
> + clone() and clone2() (on IA64) system calls. The usage of
> + these two fields depends on the processor architecture.
> +
> + Most architectures use ->child_stack to pass-in a stack-pointer
to pass in
> + itself and don't need the ->child_stack_size field. On these
> + architectures the ->child_stack_size field must be 0.
> +
> + Some architectures, eg IA64, use ->child_stack to pass-in the
e.g. to pass in
> + base of the region allocated for stack. These architectures
> + must pass in the size of the stack-region in ->child_stack_size.
stack region
Seems unfortunate that different architectures use the fields differently.
> +
> + u64 parent_tid_ptr;
> + u64 child_tid_ptr;
> +
> + These two fields correspond to the 'parent_tid_ptr' and
> + 'child_tid_ptr' fields in the clone() system call
system call.
> +
> + u32 nr_pids;
> +
> + nr_pids specifies the number of pids in the @pids array
> + parameter to eclone() (see below). nr_pids should not exceed
> + the current nesting level of the calling process (i.e if the
i.e.
> + process is in init_pid_ns, nr_pids must be 1, if process is
> + in a pid namespace that is a child of init-pid-ns, nr_pids
> + cannot exceed 2, and so on).
> +
> + u32 reserved0;
> + u64 reserved1;
> +
> + These fields are intended to extend the functionality of the
> + eclone() in the future, while preserving backward compatibility.
> + They must be set to 0 for now.
The struct does not have a reserved1 field AFAICT.
> + The @cargs_size parameter specifes the sizeof(struct clone_args) and
> + is intended to enable extending this structure in the future, while
> + preserving backward compatibility. For now, this field must be set
> + to the sizeof(struct clone_args) and this size must match the kernel's
> + view of the structure.
> +
> + The @pids parameter defines the set of pids that should be assigned to
> + the child process in its active and ancestor pid namespaces. The
> + descendant pid namespaces do not matter since a process does not have a
> + pid in descendant namespaces, unless the process is in a new pid
> + namespace in which case the process is a container-init (and must have
> + the pid 1 in that namespace).
> +
> + See CLONE_NEWPID section of clone(2) man page for details about pid
of the clone(2)
> + namespaces.
> +
> + If a pid in the @pids list is 0, the kernel will assign the next
> + available pid in the pid namespace.
> +
> + If a pid in the @pids list is non-zero, the kernel tries to assign
> + the specified pid in that namespace. If that pid is already in use
> + by another process, the system call fails (see EBUSY below).
> +
> + The order of pids in @pids is oldest in pids[0] to youngest pid
> + namespace in pids[nr_pids-1]. If the number of pids specified in the
> + @pids list is fewer than the nesting level of the process, the pids
> + are applied from youngest namespace. i.e if the process is nested in
the youngest namespace. I.e.
> + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
> + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
> + have a pid of '0' (the kernel will assign a pid in those namespaces).
> +
> + On success, the system call returns the pid of the child process in
> + the parent's active pid namespace.
> +
> + On failure, eclone() returns -1 and sets 'errno' to one of following
> + values (the child process is not created).
> +
> + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to
> + specify the pids in this call (if pids are not specifed
> + CAP_SYS_ADMIN is not required).
> +
> + EINVAL The number of pids specified in 'clone_args.nr_pids' exceeds
> + the current nesting level of parent process
process.
> +
> + EINVAL Not all specified clone-flags are valid.
> +
> + EINVAL The reserved fields in the clone_args argument are not 0.
> +
> + EINVAL The child_stack_size field is not 0 (on architectures that
> + pass in a stack pointer in ->child_stack field)
field).
> +
> + EBUSY A requested pid is in use by another process in that namespace.
> +
> +---
Is this example program meant to build only on i386?
On x86_64 I get:
eclone-syscall-test.c: In function 'do_clone':
eclone-syscall-test.c:166: warning: assignment makes pointer from integer without a cast
/tmp/cc0OrhU3.o: In function `do_clone':
eclone-syscall-test.c:(.text+0x173): undefined reference to `setup_stack'
eclone-syscall-test.c:(.text+0x1e2): undefined reference to `eclone'
> +/*
> + * Example eclone() usage - Create a child process with pid CHILD_TID1 in
> + * the current pid namespace. The child gets the usual "random" pid in any
> + * ancestor pid namespaces.
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <signal.h>
> +#include <errno.h>
> +#include <unistd.h>
> +#include <wait.h>
> +#include <sys/syscall.h>
> +
> +#define __NR_eclone 337
> +#define CLONE_NEWPID 0x20000000
> +#define CLONE_CHILD_SETTID 0x01000000
> +#define CLONE_PARENT_SETTID 0x00100000
> +#define CLONE_UNUSED 0x00001000
> +
> +#define STACKSIZE 8192
> +
> +typedef unsigned long long u64;
> +typedef unsigned int u32;
> +typedef int pid_t;
> +struct clone_args {
> + u64 clone_flags_high;
> + u64 child_stack;
> + u64 child_stack_size;
> +
> + u64 parent_tid_ptr;
> + u64 child_tid_ptr;
> +
> + u32 nr_pids;
> +
> + u32 reserved0;
> +};
> +
> +#define exit _exit
> +
> +/*
> + * Following eclone() is based on code posted by Oren Laadan at:
> + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
> + */
> +#if defined(__i386__) && defined(__NR_eclone)
> +
> +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
> + int *pids)
> +{
> + long retval;
> +
> + __asm__ __volatile__(
> + "movl %3, %%ebx\n\t" /* flags_low -> 1st (ebx) */
> + "movl %4, %%ecx\n\t" /* clone_args -> 2nd (ecx)*/
> + "movl %5, %%edx\n\t" /* args_size -> 3rd (edx) */
> + "movl %6, %%edi\n\t" /* pids -> 4th (edi)*/
> +
> + "pushl %%ebp\n\t" /* save value of ebp */
> + "int $0x80\n\t" /* Linux/i386 system call */
> + "testl %0,%0\n\t" /* check return value */
> + "jne 1f\n\t" /* jump if parent */
> +
> + "popl %%esi\n\t" /* get subthread function */
> + "call *%%esi\n\t" /* start subthread function */
> + "movl %2,%0\n\t"
> + "int $0x80\n" /* exit system call: exit subthread */
> + "1:\n\t"
> + "popl %%ebp\t" /* restore parent's ebp */
> +
> + :"=a" (retval)
> +
> + :"0" (__NR_eclone),
> + "i" (__NR_exit),
> + "m" (flags_low),
> + "m" (clone_args),
> + "m" (args_size),
> + "m" (pids)
> + );
> +
> + if (retval < 0) {
> + errno = -retval;
> + retval = -1;
> + }
> + return retval;
> +}
> +
> +/*
> + * Allocate a stack for the clone-child and arrange to have the child
> + * execute @child_fn with @child_arg as the argument.
> + */
> +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
> +{
> + void *stack_base;
> + void **stack_top;
> +
> + stack_base = malloc(size + size);
> + if (!stack_base) {
> + perror("malloc()");
> + exit(1);
> + }
> +
> + stack_top = (void **)((char *)stack_base + (size - 4));
> + *--stack_top = child_arg;
> + *--stack_top = child_fn;
> +
> + return stack_top;
> +}
> +#endif
> +
> +/* gettid() is a bit more useful than getpid() when messing with clone() */
> +int gettid()
> +{
> + int rc;
> +
> + rc = syscall(__NR_gettid, 0, 0, 0);
> + if (rc < 0) {
> + printf("rc %d, errno %d\n", rc, errno);
> + exit(1);
> + }
> + return rc;
> +}
> +
> +#define CHILD_TID1 377
> +#define CHILD_TID2 1177
> +#define CHILD_TID3 2799
> +
> +struct clone_args clone_args;
> +void *child_arg = &clone_args;
> +int child_tid;
> +
> +int do_child(void *arg)
> +{
> + struct clone_args *cs = (struct clone_args *)arg;
> + int ctid;
> +
> + /* Verify we pushed the arguments correctly on the stack... */
> + if (arg != child_arg) {
> + printf("Child: Incorrect child arg pointer, expected %p,"
> + "actual %p\n", child_arg, arg);
> + exit(1);
> + }
> +
> + /* ... and that we got the thread-id we expected */
> + ctid = *((int *)(unsigned long)cs->child_tid_ptr);
> + if (ctid != CHILD_TID1) {
> + printf("Child: Incorrect child tid, expected %d, actual %d\n",
> + CHILD_TID1, ctid);
> + exit(1);
> + } else {
> + printf("Child got the expected tid, %d\n", gettid());
> + }
> + sleep(2);
> +
> + printf("[%d, %d]: Child exiting\n", getpid(), ctid);
> + exit(0);
> +}
> +
> +static int do_clone(int (*child_fn)(void *), void *child_arg,
> + unsigned int flags_low, int nr_pids, pid_t *pids_list)
> +{
> + int rc;
> + void *stack;
> + struct clone_args *ca = &clone_args;
> + int args_size;
> +
> + stack = setup_stack(child_fn, child_arg, STACKSIZE);
> +
> + memset(ca, 0, sizeof(*ca));
> +
> + ca->child_stack = (u64)(unsigned long)stack;
> + ca->child_stack_size = (u64)0;
> + ca->child_tid_ptr = (u64)(unsigned long)&child_tid;
> + ca->nr_pids = nr_pids;
> +
> + args_size = sizeof(struct clone_args);
> + rc = eclone(flags_low, ca, args_size, pids_list);
> +
> + printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
> + rc, errno);
> + return rc;
> +}
> +
> +/*
> + * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
> + * The test case creates a child in the current pid namespace and uses only
> + * the first value, CHILD_TID1.
> + */
> +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
> +int main()
> +{
> + int rc, pid, status;
> + unsigned long flags;
> + int nr_pids = 1;
> +
> + flags = SIGCHLD|CLONE_CHILD_SETTID;
> +
> + pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
> +
> + printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
> +
> + rc = waitpid(pid, &status, __WALL);
> + if (rc < 0) {
> + printf("waitpid(): rc %d, error %d\n", rc, errno);
> + } else {
> + printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
> + gettid(), rc, status);
> +
> + if (WIFEXITED(status)) {
> + printf("\t EXITED, %d\n", WEXITSTATUS(status));
> + } else if (WIFSIGNALED(status)) {
> + printf("\t SIGNALED, %d\n", WTERMSIG(status));
> + }
> + }
> + return 0;
> +}
> --
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
2010-05-05 21:14 ` Randy Dunlap
@ 2010-05-05 22:25 ` Sukadev Bhattiprolu
0 siblings, 0 replies; 14+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-05 22:25 UTC (permalink / raw)
To: Randy Dunlap
Cc: Oren Laadan, Andrew Morton, containers, linux-kernel,
Serge Hallyn, Matt Helsley, Pavel Emelyanov, linux-api, x86,
linux-s390, linuxppc-dev
Randy Dunlap [randy.dunlap@oracle.com] wrote:
| > + base of the region allocated for stack. These architectures
| > + must pass in the size of the stack-region in ->child_stack_size.
|
| stack region
|
| Seems unfortunate that different architectures use the fields differently.
Yes and no. The field still has a single purpose, just that some architectures
may not need it. We enforce that if unused on an architecture, the field must
be 0. It looked like the easiest way to keep the API common across
architectures.
|
| Is this example program meant to build only on i386?
Yes. Will add a pointer to the clone*.[chS] and libeclone.a files in
git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
for other architectures (currently x86_64, ppc, s390).
Thanks for the review. Will fix the errors and repost.
Sukadev
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-06-10 9:16 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-29 10:31 [PATCH v21 011/100] eclone (11/11): Document sys_eclone Albert Cahalan
2010-06-01 19:32 ` Sukadev Bhattiprolu
2010-06-01 19:59 ` Albert Cahalan
2010-06-02 1:38 ` Sukadev Bhattiprolu
2010-06-05 11:49 ` Albert Cahalan
2010-06-05 11:58 ` Albert Cahalan
2010-06-05 12:08 ` Albert Cahalan
2010-06-09 18:14 ` Sukadev Bhattiprolu
2010-06-09 18:46 ` H. Peter Anvin
2010-06-09 22:32 ` Roland McGrath
2010-06-10 9:15 ` Arnd Bergmann
-- strict thread matches above, loose matches on Subject: below --
2010-05-01 14:14 [PATCH v21 00/100] Kernel based checkpoint/restart Oren Laadan
2010-05-01 14:14 ` [PATCH v21 011/100] eclone (11/11): Document sys_eclone Oren Laadan
2010-05-05 21:14 ` Randy Dunlap
2010-05-05 22:25 ` Sukadev Bhattiprolu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).