public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Daniel Lezcano <daniel.lezcano@free.fr>
To: Oren Laadan <orenl@librato.com>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>,
	randy.dunlap@oracle.com, arnd@arndb.de,
	linux-api@vger.kernel.org,
	Containers <containers@lists.linux-foundation.org>,
	Nathan Lynch <nathanl@austin.ibm.com>,
	linux-kernel@vger.kernel.org, Louis.Rilling@kerlabs.com,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	kosaki.motohiro@jp.fujitsu.com, hpa@zytor.com, mingo@elte.hu,
	torvalds@linux-foundation.org,
	Alexey Dobriyan <adobriyan@gmail.com>,
	roland@redhat.com, Pavel Emelyanov <xemul@openvz.org>
Subject: Re: [RFC][v8][PATCH 0/10] Implement clone3() system call
Date: Thu, 22 Oct 2009 13:22:49 +0200	[thread overview]
Message-ID: <4AE04089.9020907@free.fr> (raw)
In-Reply-To: <4ADF56D4.8030405@librato.com>

Oren Laadan wrote:
>
> Daniel Lezcano wrote:
>> Oren Laadan wrote:
>>> Daniel Lezcano wrote:
>> [ ... ]
>>
>>>> I forgot to mention a constraint with the specified pid : P2 has to
>>>> be child of P1.
>>>> In other word, you can not specify a pid to clonat which is not your
>>>> descendant (including yourself).
>>>> With this constraint I think there is no security issues.
>>> Sounds dangerous. What if your descendant executed a setuid program ?
>> That does not happen because you inherit the context of the caller.
>>
>>>> Concerning of forking on behalf of another process, we can consider
>>>> it is up to the caller / programmer to know what it does. If a
>>>> process in 
>>> Before the user can program with this syscall, _you_ need to define
>>> the semantics of this syscall. 
>> Yes, you are right. Here it is the proposition of the semantics.
>>
>> Function prototype is:
>>
>> pid_t cloneat(pid_t pid, pid_t hint, struct clone_args *args);
>>
>> Structure types are:
>>
>> typedef int clone_flag_t;
>>
>> struct clone_args {
>>     clone_flag_t *flags;
>>     int flags_size;
>>     u32 reserved1;
>>     u32 reserved2;
>>     u64 child_stack_base;
>>     u64 child_stack_size;
>>     u64 parent_tid_ptr;
>>     u64 child_tid_ptr;
>>     u64 reserved3;
>> };
>>
>> With the helper macros:
>>
>> void CLONE_SET(int flag, clone_flag_t *flags);
>> void CLONE_CLR(int flag, clone_flag_t *flags);
>> bool CLONE_ISSET(int flag, clone_flag_t *flags);
>> void CLONE_ZERO(flag_t *clone_flags);
>>
>> And:
>>
>> #define CLONEXT_VM      0x20  /* CLONE_VM>>3 */ #define CLONEXT_FS     
>> 0x21
>> #define CLONEXT_FILES   0x22
>> ...
>>
>
> The main motivation for your new syscall is to make it possible to
> inject a process into a namespace. IOW, what you are proposing is
> a new incarnation of sys_hijack().
>
> This is _orthogonal_ to the current discussion, which is about an
> extension for clone to allow (a) choosing target pid(s), (b) more
> flags, and (c) future extensions.
>
> (Your suggested syscall may, too, allow the request a specific set
> of pids for the child process, and reuse the current code for that).
>
> I suggest that you start a new thread about your RFC. This will
> reduce distractions on the current thread, and bring more focus to
> your proposal. I surely will post some comments there :)

I can argue exactly the same thing, the main motivation for your new 
syscall is to make it possible to restart a process tree for a 
checkpoint / restart and this is orthogonal with adding extended clone 
flags :)

But my main motivation is to have the possibility to a) choose a target 
__and__ b) clone the process relatively to another one. These 2 features 
allows to do what *we* need, that is recreate a process tree and the 
bonus with this approach is the ability to inject a process into a 
namespace, something asked by several people, eg. debug with gdb an 
application running into another pid namespace (is not supported today).

I am sorry for coming late in the discussion and for distracting.

> [...]
>
>> The cloneat syscall can be used for the following use cases:
>>
>> * checkpoint / restart:
>>
>> The restart can be done with a clone(.., CLONE_NEWPID|...);
>> Then the new pid (aka pid 1) retrieves the proctree from the statefile
>> and creates the different tasks with the process hierarchy with the
>> cloneat syscall.
>
> s/cloneat/$CLONE3/
> (hint: this is how it's done now)
Of course, what is described is what you does with 'clone3' !
Do you think I will come proposing a variant of 'clone3' not doing what 
you need ? :)

>> The proctree creation can be done from outside of the pid namespace or
>> from inside.
>
> Ew .. why would you do that ?
And why not. Is there a semantic specifying how a process tree should be 
recreated ?

>> Concerning nested pid namespaces, IMHO I would not try to checkpoint /
>> restart them. The checkpoint of a nested pid namespace should be
>> forbidden except for the leaf of a pid namespaces tree. That should
>
> Others (me included) *will* try and may get upset if forbidden...
> Seriously, there is no technical reason to restrict this.

Ok.

>  >> Can you define more precisely what you mean by "enter" the container ?
>>> If you simply want create a new process in the container, you can
>>> achieve the same thing with a daemon, or a smart init process (in
>>> there), or even ptrace tricks.
>> Yes, you can launch a daemon inside the container, that works for a
>> system container because the container is killed by killing the first
>> process of the container or by a shutdown inside the container (not
>> fully implemented in the kernel).
>> But this is unreliable for application containers, I won't enter in the
>> details but the container exits when the application exits, with a
>> daemon inside the container, this is no longer the case because you can
>> not detect the application death as the daemon is always there.
>>
>> With cloneat you restrict the life cycle of the command you launched,
>> that is the container exits as soon as all the processes exited the
>> container, including the spawned command itself.
>
> Then start a daemon _in addition_ to the application, or write a
> daemon that will launch the application and monitor it... And also
> there is ptrace -
Already tried :)

http://lxc.git.sourceforge.net/git/gitweb.cgi?p=lxc/lxc;a=blob;f=src/lxc/lxc_cinit.c;h=8f235483c1a9d9c9e0cc1ba69f1c33f1bc98b8aa;hb=57ff723f6a174a2a01c58c6ac367d118ef12b91c

> But, please let's take this off to a new thread about adding how to
> add a process into a namespace from the outside. FYI, I do think
> such an interface may be useful and nicer than the two alternatives
> I suggested above.
>
>>> Also, there is a reason why sys_hijack() was hijacked away ... And
>>> I honestly think that a syscall to force another process to clone
>>> would be shot down by the kernel guys.
>> Maybe, maybe not. CLONE_PARENT exists and looks similar to cloneat.
>
> Actually, I misread previously; I mean not forcing another process
> to clone, but instead forcing another process to become a parent (and
> I shall ignore the ethical issues :)
>
> I still suspect it won't be welcome. Several people would have liked
> to see CLONE_PARENT go away, too, if that was possible without breaking
> userspace applications. Yet another reason to take it to a discussion
> of its own.

At this point, I am hesitating of creating a new thread for this 
discussion. Because, there will be:
 * clone
 * clone2
 * clone3

and we will discuss again about a new clone syscall with a different API :(

I will not continue arguing on this thread except if someone is in favor 
of cloneat.
Otherwise, I will spawn a new thread later.

Thanks
  -- Daniel


  reply	other threads:[~2009-10-22 11:22 UTC|newest]

Thread overview: 92+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-13  4:49 [RFC][v8][PATCH 0/10] Implement clone3() system call Sukadev Bhattiprolu
2009-10-13  4:49 ` [RFC][v8][PATCH 1/10]: Factor out code to allocate pidmap page Sukadev Bhattiprolu
2009-10-13  4:50 ` [RFC][v8][PATCH 2/10]: Have alloc_pidmap() return actual error code Sukadev Bhattiprolu
2009-10-13  4:50 ` [RFC][v8][PATCH 3/10]: Make pid_max a pid_ns property Sukadev Bhattiprolu
2009-10-13  5:19   ` Alexey Dobriyan
2009-10-13 13:09   ` Pavel Emelyanov
2009-10-13 15:24     ` Serge E. Hallyn
2009-10-13 16:10       ` Pavel Emelyanov
2009-10-13 16:28         ` Serge E. Hallyn
2009-10-13  4:51 ` [RFC][v8][PATCH 4/10]: Add target_pid parameter to alloc_pidmap() Sukadev Bhattiprolu
2009-10-13 11:50   ` Pavel Emelyanov
2009-10-15  0:24     ` Sukadev Bhattiprolu
2009-10-13  4:51 ` [RFC][v8][PATCH 5/10]: Add target_pids parameter to alloc_pid() Sukadev Bhattiprolu
2009-10-13  4:52 ` [RFC][v8][PATCH 6/10]: Add target_pids parameter to copy_process() Sukadev Bhattiprolu
2009-10-13  4:52 ` [RFC][v8][PATCH 7/10]: Check invalid clone flags Sukadev Bhattiprolu
2009-10-13 18:35   ` Oren Laadan
2009-10-13 23:38     ` Sukadev Bhattiprolu
2009-10-13  4:52 ` [RFC][v8][PATCH 8/10]: Define do_fork_with_pids() Sukadev Bhattiprolu
2009-10-13  4:54 ` [RFC][v8][PATCH 9/10]: Define clone3() syscall Sukadev Bhattiprolu
2009-10-13 18:46   ` Oren Laadan
2009-10-16  4:20   ` Sukadev Bhattiprolu
2009-10-16  6:25     ` Michael Kerrisk
2009-10-16 18:06       ` Sukadev Bhattiprolu
2009-10-19 17:44         ` Matt Helsley
2009-10-19 21:31           ` H. Peter Anvin
2009-10-19 23:50             ` Matt Helsley
2009-10-21  4:26               ` Michael Kerrisk
2009-10-21 13:03                 ` H. Peter Anvin
2009-10-21 19:44                   ` Sukadev Bhattiprolu
2009-10-21 22:03                     ` H. Peter Anvin
2009-10-22 10:40                     ` Michael Kerrisk
2009-10-22 18:10                       ` Sukadev Bhattiprolu
2009-10-22 10:26                   ` Michael Kerrisk
2009-10-22 11:38                     ` H. Peter Anvin
2009-10-22 12:14                       ` Michael Kerrisk
2009-10-22 12:19                         ` H. Peter Anvin
2009-10-22 13:57                         ` Matt Helsley
2009-10-13  4:55 ` [RFC][v8][PATCH 10/10]: Document " Sukadev Bhattiprolu
2009-10-14 12:26   ` Arnd Bergmann
2009-10-14 18:39     ` Sukadev Bhattiprolu
2009-10-19 21:36   ` Pavel Machek
2009-10-21  8:37     ` Arnd Bergmann
2009-10-21  9:33       ` Pavel Machek
2009-10-21 13:26         ` Arnd Bergmann
2009-10-21 18:27     ` Sukadev Bhattiprolu
2009-10-13 20:50 ` [RFC][v8][PATCH 0/10] Implement clone3() system call Roland McGrath
2009-10-13 23:27   ` Sukadev Bhattiprolu
2009-10-13 23:53     ` Roland McGrath
2009-10-14  1:13       ` H. Peter Anvin
2009-10-14  4:36         ` Sukadev Bhattiprolu
2009-10-14  4:38           ` H. Peter Anvin
2009-10-14 22:36         ` Sukadev Bhattiprolu
2009-10-14 22:49           ` H. Peter Anvin
2009-10-15  0:17             ` Sukadev Bhattiprolu
2009-10-13 23:49 ` H. Peter Anvin
2009-10-14  1:39   ` Matt Helsley
2009-10-14  2:24     ` H. Peter Anvin
2009-10-14  4:40       ` Sukadev Bhattiprolu
2009-10-14  4:50         ` H. Peter Anvin
2009-10-14 16:07         ` Serge E. Hallyn
2009-10-16 19:22 ` Daniel Lezcano
2009-10-16 19:44   ` Sukadev Bhattiprolu
2009-10-19 20:34     ` Daniel Lezcano
2009-10-19 21:47       ` Oren Laadan
2009-10-20  0:51         ` Matt Helsley
2009-10-20  3:33           ` Eric W. Biederman
2009-10-20  4:03             ` Sukadev Bhattiprolu
2009-10-20 10:46               ` Eric W. Biederman
2009-10-20 14:16                 ` Serge E. Hallyn
2009-10-20 18:33                 ` Sukadev Bhattiprolu
2009-10-20 19:26                   ` Eric W. Biederman
2009-10-20 20:13                     ` Oren Laadan
2009-10-21  6:20                     ` Sukadev Bhattiprolu
2009-10-21  9:16                       ` Eric W. Biederman
2009-10-21 18:52                         ` Sukadev Bhattiprolu
2009-10-21 21:11                           ` Eric W. Biederman
2009-10-23  0:42                         ` Sukadev Bhattiprolu
2009-10-23  1:03                           ` Eric W. Biederman
2009-10-23  5:30                             ` Sukadev Bhattiprolu
2009-10-23  5:44                               ` Eric W. Biederman
2009-10-23 19:21                                 ` Sukadev Bhattiprolu
2009-10-23 20:48                                   ` Sukadev Bhattiprolu
2009-10-23 23:26                                     ` Eric W. Biederman
2009-10-24  3:38                                       ` Sukadev Bhattiprolu
2009-10-23 19:16                               ` Oren Laadan
2009-10-23 19:34                                 ` Oren Laadan
2009-10-23 23:12                                   ` Eric W. Biederman
2009-10-20 14:09             ` Serge E. Hallyn
2009-10-21 15:53         ` Daniel Lezcano
2009-10-21 18:45           ` Oren Laadan
2009-10-22 11:22             ` Daniel Lezcano [this message]
  -- strict thread matches above, loose matches on Subject: below --
2009-10-26  9:38 Albert Cahalan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4AE04089.9020907@free.fr \
    --to=daniel.lezcano@free.fr \
    --cc=Louis.Rilling@kerlabs.com \
    --cc=adobriyan@gmail.com \
    --cc=arnd@arndb.de \
    --cc=containers@lists.linux-foundation.org \
    --cc=ebiederm@xmission.com \
    --cc=hpa@zytor.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=nathanl@austin.ibm.com \
    --cc=orenl@librato.com \
    --cc=randy.dunlap@oracle.com \
    --cc=roland@redhat.com \
    --cc=sukadev@linux.vnet.ibm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=xemul@openvz.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox