* Creating tasks on restart: userspace vs kernel @ 2009-04-14 3:43 Oren Laadan 2009-04-14 9:59 ` Ingo Molnar 2009-04-14 16:36 ` Alexey Dobriyan 0 siblings, 2 replies; 21+ messages in thread From: Oren Laadan @ 2009-04-14 3:43 UTC (permalink / raw) To: containers, Alexey Dobriyan Cc: Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel, Ingo Molnar For checkpoint/restart (c/r) we need a method to (re)create the tasks tree during restart. There are basically two approaches: in userspace (zap approach) or in the kernel (openvz approach). Once tasks have been created both approaches are similar in that all restarting tasks end up calling the equivalent of "do_restart()" in the kernel to perform the gory details of restoring its state. In terms of performance, both approaches are similar, and both can optimize to avoid duplicating resources unnecessarily during the clone (e.g. mm, etc) knowing that they will be reconstructed soon after. So the question is what's better - user-space or kernel ? Too bad that Alexey chose to ignore what's been discussed in linux-containers mailing list in his recent post. Here is my take on cons/pros. Task creation in the kernel --------------------------- * how: the user program calls sys_restart() which, for each task to restore, creates a kernel thread which is demoted to a regular process manually. * pro: a single task that calls sys_restart() * pro: restarting tasks are in full control of kernel at all times * con: arch-dependent, harder to port across architectures * con: can only restart a full container Task creation in user space --------------------------- * how: the user programs calls fork/clone to recreate a suitable task tree in userspace, and each task calls sys_restart() to restore its state; some kernel glue is necessary to synchronize restarting tasks when in the kernel. * pro: allows important flexibility during restart (see <1>) * pro: code leverages existing well-understood syscalls (fork, clone) * pro: allows restart of a only subtree (see <2>) * con: requires a way to creates tasks with specific pid (see <3>) <1> Flexibility: In the spirit of madvise() that lets tasks advise the kernel because they know better, there should be cradvise() for checkpoint/restart purposes. During checkpoint it can tell the kernel "don't save this piece of memory, it's scratch", or "ignore this file-descriptor" etc. During restart, it will can tell the kernel "use this file-descriptor" or "use this network namespace" (instead of trying to restore). Offering cradvise() capability during restart is especially important in cases where the kernel (inevitably) won't know how to restore a resource (e.g. think special devices), when the application wants to override (e.g. think of a c/r aware server that would like to change the port on which it is listening), or when it's that much simpler to do it in userspace (e.g. think setting up network namespaces). Another important example is distributed checkpoint, where the restarting tasks could (re)create all their network connections in user space, before invoking sys_restart() and tell the kernel, via cradvise(), to use the newly created sockets. The need for this sort of flexibility has been stressed multiple times and by multiple stake-holders interested in checkpoint/restart. <2> Restarting a subtree: The primary c/r effort is directed towards providing c/r functionality for containers. Wouldn't it be nice if, while doing so and at minimal added effort, we also gain a method to checkpoint and restart an arbitrary subtree of tasks, which isn't necessarily an entire container ? Sure, it will be more constrained (e.g. resulting pid in restart won't match the original pids), and won't work for all applications. But it will still be a useful tool for many use cases, like batch cpu jobs, some servers, vnc sessions (if you want graphics) etc. Imagine you run 'octave' for a week and must reboot now - 'octave' wouldn't care if you checkpointed it and then restart with a different pid ! <3> Clone with pid: To restart processes from userspace, there needs to be a way to request a specific pid--in the current pid_ns--for the child process (clearly, if it isn't in use). Why is it a disadvantage ? to Linus, a syscall clone_with_pid() "sounds like a _wonderful_ attack vector against badly written user-land software...". Actually, getting a specific pid is possible without this syscall. But the point is that it's undesirable to have this functionality unrestricted. So one option is to require root privileges. Another option is to restrict such action in pid_ns created by the same user. Even more so, restrict to only containers that are being restarted. --- Either way we go, it should be fairly easy to switch from one method to the other, should we need to. All in all, there isn't a strong reason in favor of kernel method. In contrast, it's at least as simple in userspace (reusing existing syscalls). More importantly, the flexibility that we gain with restart of tasks in userspace, no cost incurred (in terms of implementation or runtime overhead). Oren. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Creating tasks on restart: userspace vs kernel 2009-04-14 3:43 Creating tasks on restart: userspace vs kernel Oren Laadan @ 2009-04-14 9:59 ` Ingo Molnar 2009-04-14 14:53 ` Oren Laadan 2009-04-14 16:36 ` Alexey Dobriyan 1 sibling, 1 reply; 21+ messages in thread From: Ingo Molnar @ 2009-04-14 9:59 UTC (permalink / raw) To: Oren Laadan Cc: containers, Alexey Dobriyan, Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel * Oren Laadan <orenl@cs.columbia.edu> wrote: > <3> Clone with pid: > > To restart processes from userspace, there needs to be a way to > request a specific pid--in the current pid_ns--for the child > process (clearly, if it isn't in use). > > Why is it a disadvantage ? to Linus, a syscall clone_with_pid() > "sounds like a _wonderful_ attack vector against badly written > user-land software...". Actually, getting a specific pid is > possible without this syscall. But the point is that it's > undesirable to have this functionality unrestricted. The point is that there's a class of a difference between a racy and unreliable method of 'create tens of thousands of tasks to steal the right PID you are interested in' and a built-in syscall that gives this within a couple of microseconds. Most signal races are timing dependent so the ability to do it really quickly makes or breaks the practicality of many classes of exploits. > So one option is to require root privileges. Another option is to > restrict such action in pid_ns created by the same user. Even more > so, restrict to only containers that are being restarted. Requiring root privileges seems to remove much of the appeal of allowing this to be a more generic sub-container creation thing. If regular unprivileged apps cannot use this to save/restore their own local task hierarchy, the whole thing becomes rather pointless, right? Ingo ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Creating tasks on restart: userspace vs kernel 2009-04-14 9:59 ` Ingo Molnar @ 2009-04-14 14:53 ` Oren Laadan 2009-04-14 16:16 ` Serge E. Hallyn 0 siblings, 1 reply; 21+ messages in thread From: Oren Laadan @ 2009-04-14 14:53 UTC (permalink / raw) To: Ingo Molnar Cc: containers, Alexey Dobriyan, Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel Ingo Molnar wrote: > * Oren Laadan <orenl@cs.columbia.edu> wrote: > >> <3> Clone with pid: >> >> To restart processes from userspace, there needs to be a way to >> request a specific pid--in the current pid_ns--for the child >> process (clearly, if it isn't in use). >> >> Why is it a disadvantage ? to Linus, a syscall clone_with_pid() >> "sounds like a _wonderful_ attack vector against badly written >> user-land software...". Actually, getting a specific pid is >> possible without this syscall. But the point is that it's >> undesirable to have this functionality unrestricted. > > The point is that there's a class of a difference between a racy and > unreliable method of 'create tens of thousands of tasks to steal the > right PID you are interested in' and a built-in syscall that gives > this within a couple of microseconds. > > Most signal races are timing dependent so the ability to do it > really quickly makes or breaks the practicality of many classes of > exploits. Exactly. > >> So one option is to require root privileges. Another option is to >> restrict such action in pid_ns created by the same user. Even more >> so, restrict to only containers that are being restarted. > > Requiring root privileges seems to remove much of the appeal of > allowing this to be a more generic sub-container creation thing. If > regular unprivileged apps cannot use this to save/restore their own > local task hierarchy, the whole thing becomes rather pointless, > right? First, I suggest to distinguish between two cases: (1) c/r of a whole container, and (2) c/r of a task subtree. (#2 is a nice byproduct of this work, but with more limited scope/applicability). #2 is easier: we don't use a new ipc_ns necessarily, so we don't need to (and perhaps can't) restore old pids. So there is no question about privileges. (This of course requires that the application be c/r-aware or c/r-agnostic). For #1, we need to create a new container to begin with. This already requires CAP_SYS_ADMIN. Yes, for now we can use some setuid() to create a new pid_ns and then do the restart. We will eventually need CAP_SYS_ADMIN for other parts of the restart, for instance to restore a listening socket on a privileged port, or to restore tasks of multiple users, or to restore an open file accessible by, say, root only (assume the original task opened the file and then dropped its privileges). So for c/r - eventually we'll need to trust something in the checkpoint image, like you trust a kernel module. One way to do it is to have the userland utility (particularly restart) setuid, and have it sign the image during checkpoint and then verify the signature during restart. Oren. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Creating tasks on restart: userspace vs kernel 2009-04-14 14:53 ` Oren Laadan @ 2009-04-14 16:16 ` Serge E. Hallyn 0 siblings, 0 replies; 21+ messages in thread From: Serge E. Hallyn @ 2009-04-14 16:16 UTC (permalink / raw) To: Oren Laadan Cc: Ingo Molnar, containers, Alexey Dobriyan, Dave Hansen, Andrew Morton, Linus Torvalds, Linux-Kernel Quoting Oren Laadan (orenl@cs.columbia.edu): > For #1, we need to create a new container to begin with. This already > requires CAP_SYS_ADMIN. Yes, for now we can use some setuid() to create > a new pid_ns and then do the restart. This is why I like tagging a pidns with a userid, and requiring that current->euid==pidns->uid in order to be allowed to set pid in that pidns. We require cap_sys_admin wil doing clone(CLONE_NEWPID). So if we do that while uid=500, then drop cap_sys_admin, then we can proceed to create new tasks with specified pids in that pidns. -serge ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Creating tasks on restart: userspace vs kernel 2009-04-14 3:43 Creating tasks on restart: userspace vs kernel Oren Laadan 2009-04-14 9:59 ` Ingo Molnar @ 2009-04-14 16:36 ` Alexey Dobriyan 2009-04-14 16:46 ` Alexey Dobriyan 2009-04-14 18:40 ` Oren Laadan 1 sibling, 2 replies; 21+ messages in thread From: Alexey Dobriyan @ 2009-04-14 16:36 UTC (permalink / raw) To: Oren Laadan Cc: containers, Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel, Ingo Molnar On Mon, Apr 13, 2009 at 11:43:30PM -0400, Oren Laadan wrote: > For checkpoint/restart (c/r) we need a method to (re)create the tasks > tree during restart. There are basically two approaches: in userspace > (zap approach) or in the kernel (openvz approach). > > Once tasks have been created both approaches are similar in that all > restarting tasks end up calling the equivalent of "do_restart()" in > the kernel to perform the gory details of restoring its state. > > In terms of performance, both approaches are similar, and both can > optimize to avoid duplicating resources unnecessarily during the > clone (e.g. mm, etc) knowing that they will be reconstructed soon > after. > > So the question is what's better - user-space or kernel ? > > Too bad that Alexey chose to ignore what's been discussed in > linux-containers mailing list in his recent post. Here is my take on > cons/pros. > > Task creation in the kernel > --------------------------- > * how: the user program calls sys_restart() which, for each task to > restore, creates a kernel thread which is demoted to a regular > process manually. > > * pro: a single task that calls sys_restart() > * pro: restarting tasks are in full control of kernel at all times > > * con: arch-dependent, harder to port across architectures Not a "con" at all. For the usage purposes, kernel_thread() is arch-independent. Filesystems create kernel threads and not have a single bit of arch-specific code. > * con: can only restart a full container This is by design. Granularity of whole damn thing is one container both on checkpoint and restart. You want to chop pieces, fine, do surgery on _image_. > Task creation in user space > --------------------------- > * how: the user programs calls fork/clone to recreate a suitable > task tree in userspace, and each task calls sys_restart() to restore > its state; some kernel glue is necessary to synchronize restarting > tasks when in the kernel. > * pro: allows important flexibility during restart (see <1>) > * pro: code leverages existing well-understood syscalls (fork, clone) kernel_thread() is effectively clone(2). > * pro: allows restart of a only subtree (see <2>) > > * con: requires a way to creates tasks with specific pid (see <3>) > > <1> Flexibility: > > In the spirit of madvise() that lets tasks advise the kernel because > they know better, there should be cradvise() for checkpoint/restart > purposes. During checkpoint it can tell the kernel "don't save this > piece of memory, it's scratch", or "ignore this file-descriptor" etc. > During restart, it will can tell the kernel "use this file-descriptor" > or "use this network namespace" (instead of trying to restore). > > Offering cradvise() capability during restart is especially important > in cases where the kernel (inevitably) won't know how to restore a > resource (e.g. think special devices), when the application wants to > override (e.g. think of a c/r aware server that would like to change > the port on which it is listening), or when it's that much simpler to > do it in userspace (e.g. think setting up network namespaces). > > Another important example is distributed checkpoint, where the > restarting tasks could (re)create all their network connections in > user space, before invoking sys_restart() and tell the kernel, via > cradvise(), to use the newly created sockets. > > The need for this sort of flexibility has been stressed multiple times > and by multiple stake-holders interested in checkpoint/restart. > > <2> Restarting a subtree: > > The primary c/r effort is directed towards providing c/r functionality > for containers. > > Wouldn't it be nice if, while doing so and at minimal added effort, we > also gain a method to checkpoint and restart an arbitrary subtree of > tasks, which isn't necessarily an entire container ? Do this in userspace. > Sure, it will be more constrained (e.g. resulting pid in restart won't > match the original pids), and won't work for all applications. Given correctly written image chopper, all pids will be fine and correctness will be bounded by how good user understands what can be chopped and can't. Besides, if such chopper can only chop task_struct's, you'll get correct image. In the end correctness of chopping will be equal to how good user understands that two task_struct's are independent of each other. > But it will still be a useful tool for many use cases, like batch cpu jobs, > some servers, vnc sessions (if you want graphics) etc. Imagine you run > 'octave' for a week and must reboot now - 'octave' wouldn't care if > you checkpointed it and then restart with a different pid ! > > <3> Clone with pid: > > To restart processes from userspace, there needs to be a way to > request a specific pid--in the current pid_ns--for the child process > (clearly, if it isn't in use). > > Why is it a disadvantage ? to Linus, a syscall clone_with_pid() > "sounds like a _wonderful_ attack vector against badly written > user-land software...". Actually, getting a specific pid is possible > without this syscall. But the point is that it's undesirable to have > this functionality unrestricted. > > So one option is to require root privileges. Another option is to > restrict such action in pid_ns created by the same user. Even more so, > restrict to only containers that are being restarted. You want to do small part in userspace and consequently end up with hacks both userspace-visible and in-kernel. Pids aren't special, they are struct pid, dynamically allocated and refcounted just like any other structtures. They _become_ special for you intended method of restart. You also have flags in nsproxy image (or where?) like "do clone with CLONE_NEWUTS". This is unneeded! nsproxy (or task_struct) image have reference (objref/position) to uts_ns image. On restart, one lookups object by reference or restore it if needed, takes refcount and glue. Just like with every other two structures. No "what to do, what to do" logic. > Either way we go, it should be fairly easy to switch from one method > to the other, should we need to. > > All in all, there isn't a strong reason in favor of kernel method. > > In contrast, it's at least as simple in userspace (reusing existing > syscalls). More importantly, the flexibility that we gain with restart > of tasks in userspace, no cost incurred (in terms of implementation or > runtime overhead). Regarding of who should orchestrate restart(2). Special process who calls restart(2) should do it. It doesn't relate to restarted process at all. It isn't, for example, init of future container. Reasons: 1) somebody should write registers before final jump to userspace. Task itself can't generally do it: struct pt_regs is in the same place as kernel stack. cr_load_cpu_regs() does exactly this: as current writes to it's own pt_regs. Oren, why don't you see crashes? I first tried to do it and was greeted with horrible crashes because e.g current becoming NULL under current. That's why cr_arch_restore_task_struct() is not done in current context. 2) Somebody should restore who is parent of whom, who ptraces whom, resolved all possible loops and so on. Intuition tells me that it should be context which is not involved into restart(2) other than doing this post-restart-before-thaw part. Consequently, it should not be init of futire container. It's just another task after all, from the POV of reparenting code. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Creating tasks on restart: userspace vs kernel 2009-04-14 16:36 ` Alexey Dobriyan @ 2009-04-14 16:46 ` Alexey Dobriyan 2009-04-14 18:40 ` Oren Laadan 1 sibling, 0 replies; 21+ messages in thread From: Alexey Dobriyan @ 2009-04-14 16:46 UTC (permalink / raw) To: Oren Laadan Cc: containers, Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel, Ingo Molnar > 1) somebody should write registers before final jump to userspace. > Task itself can't generally do it: struct pt_regs is in the same place > as kernel stack. > > cr_load_cpu_regs() does exactly this: as current writes to it's own > pt_regs. Oren, why don't you see crashes? > > I first tried to do it and was greeted with horrible crashes because > e.g current becoming NULL under current. That's why > cr_arch_restore_task_struct() is not done in current context. Hmm, this must an artefact of kernel_thread() approach. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Creating tasks on restart: userspace vs kernel 2009-04-14 16:36 ` Alexey Dobriyan 2009-04-14 16:46 ` Alexey Dobriyan @ 2009-04-14 18:40 ` Oren Laadan 2009-04-14 19:59 ` Alexey Dobriyan 2009-04-15 19:56 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Alexey Dobriyan 1 sibling, 2 replies; 21+ messages in thread From: Oren Laadan @ 2009-04-14 18:40 UTC (permalink / raw) To: Alexey Dobriyan Cc: containers, Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel, Ingo Molnar Alexey Dobriyan wrote: > On Mon, Apr 13, 2009 at 11:43:30PM -0400, Oren Laadan wrote: >> For checkpoint/restart (c/r) we need a method to (re)create the tasks >> tree during restart. There are basically two approaches: in userspace >> (zap approach) or in the kernel (openvz approach). >> >> Once tasks have been created both approaches are similar in that all >> restarting tasks end up calling the equivalent of "do_restart()" in >> the kernel to perform the gory details of restoring its state. >> >> In terms of performance, both approaches are similar, and both can >> optimize to avoid duplicating resources unnecessarily during the >> clone (e.g. mm, etc) knowing that they will be reconstructed soon >> after. >> >> So the question is what's better - user-space or kernel ? >> >> Too bad that Alexey chose to ignore what's been discussed in >> linux-containers mailing list in his recent post. Here is my take on >> cons/pros. >> >> Task creation in the kernel >> --------------------------- >> * how: the user program calls sys_restart() which, for each task to >> restore, creates a kernel thread which is demoted to a regular >> process manually. >> >> * pro: a single task that calls sys_restart() >> * pro: restarting tasks are in full control of kernel at all times >> >> * con: arch-dependent, harder to port across architectures > > Not a "con" at all. > > For the usage purposes, kernel_thread() is arch-independent. > Filesystems create kernel threads and not have a single bit of > arch-specific code. My bad, I was relying on the older patchset submitted by Andrey. > >> * con: can only restart a full container > > This is by design. > > Granularity of whole damn thing is one container both on checkpoint and > restart. > > You want to chop pieces, fine, do surgery on _image_. I challenged that design decision already :) Again, so to checkpoint one task in the topmost pid-ns you need to checkpoint (if at all possible) the entire system ?! > >> Task creation in user space >> --------------------------- >> * how: the user programs calls fork/clone to recreate a suitable >> task tree in userspace, and each task calls sys_restart() to restore >> its state; some kernel glue is necessary to synchronize restarting >> tasks when in the kernel. > >> * pro: allows important flexibility during restart (see <1>) >> * pro: code leverages existing well-understood syscalls (fork, clone) > > kernel_thread() is effectively clone(2). By "leverage" I mean that no hand-crafting is needed later - where if you use kernel thread you need to convert it to a process, reparent it etc. The more we rely on existing code, the faster the path to enter mainline kernel, the less maintenance in the future, and the less likely it breaks due to other kernel changes. > >> * pro: allows restart of a only subtree (see <2>) >> >> * con: requires a way to creates tasks with specific pid (see <3>) >> >> <1> Flexibility: >> >> In the spirit of madvise() that lets tasks advise the kernel because >> they know better, there should be cradvise() for checkpoint/restart >> purposes. During checkpoint it can tell the kernel "don't save this >> piece of memory, it's scratch", or "ignore this file-descriptor" etc. >> During restart, it will can tell the kernel "use this file-descriptor" >> or "use this network namespace" (instead of trying to restore). >> >> Offering cradvise() capability during restart is especially important >> in cases where the kernel (inevitably) won't know how to restore a >> resource (e.g. think special devices), when the application wants to >> override (e.g. think of a c/r aware server that would like to change >> the port on which it is listening), or when it's that much simpler to >> do it in userspace (e.g. think setting up network namespaces). >> >> Another important example is distributed checkpoint, where the >> restarting tasks could (re)create all their network connections in >> user space, before invoking sys_restart() and tell the kernel, via >> cradvise(), to use the newly created sockets. >> >> The need for this sort of flexibility has been stressed multiple times >> and by multiple stake-holders interested in checkpoint/restart. >> >> <2> Restarting a subtree: >> >> The primary c/r effort is directed towards providing c/r functionality >> for containers. >> >> Wouldn't it be nice if, while doing so and at minimal added effort, we >> also gain a method to checkpoint and restart an arbitrary subtree of >> tasks, which isn't necessarily an entire container ? > > Do this in userspace. > >> Sure, it will be more constrained (e.g. resulting pid in restart won't >> match the original pids), and won't work for all applications. > > Given correctly written image chopper, all pids will be fine and > correctness will be bounded by how good user understands what can be > chopped and can't. > > Besides, if such chopper can only chop task_struct's, you'll get correct > image. See above. > > In the end correctness of chopping will be equal to how good user > understands that two task_struct's are independent of each other. > >> But it will still be a useful tool for many use cases, like batch cpu jobs, >> some servers, vnc sessions (if you want graphics) etc. Imagine you run >> 'octave' for a week and must reboot now - 'octave' wouldn't care if >> you checkpointed it and then restart with a different pid ! >> >> <3> Clone with pid: >> >> To restart processes from userspace, there needs to be a way to >> request a specific pid--in the current pid_ns--for the child process >> (clearly, if it isn't in use). >> >> Why is it a disadvantage ? to Linus, a syscall clone_with_pid() >> "sounds like a _wonderful_ attack vector against badly written >> user-land software...". Actually, getting a specific pid is possible >> without this syscall. But the point is that it's undesirable to have >> this functionality unrestricted. >> >> So one option is to require root privileges. Another option is to >> restrict such action in pid_ns created by the same user. Even more so, >> restrict to only containers that are being restarted. > > You want to do small part in userspace and consequently end up with hacks > both userspace-visible and in-kernel. I want to extend existing kernel interface to leverage fork/clone from user space, AND to allow the flexibility mentioned above (which you conveniently ignored). All hacks are in-kernel, aren't they ? As for asking for a specific pid from user space, it can be done by: * a new syscall (restricted to user-owned-namespace or CAP_SYS_ADMIN) * a sys_restart(... SET_NEXT_PID) interface specific for restart (ugh) * setting a special /proc/PID/next_id file which is consulted by fork and in all cases, limit this so it can only allowed in a restarting container, under the proper security model (again, e.g., Serge's suggestion). > > Pids aren't special, they are struct pid, dynamically allocated and > refcounted just like any other structtures. > > They _become_ special for you intended method of restart. They are special. And I allow them not to be restored, as well, if the use case so wishes. > > You also have flags in nsproxy image (or where?) like "do clone with > CLONE_NEWUTS". Nope. Read the code. > > This is unneeded! > > nsproxy (or task_struct) image have reference (objref/position) to uts_ns image. > > On restart, one lookups object by reference or restore it if needed, > takes refcount and glue. Just like with every other two structures. That's exactly how it's done. > > No "what to do, what to do" logic. > >> Either way we go, it should be fairly easy to switch from one method >> to the other, should we need to. >> >> All in all, there isn't a strong reason in favor of kernel method. >> >> In contrast, it's at least as simple in userspace (reusing existing >> syscalls). More importantly, the flexibility that we gain with restart >> of tasks in userspace, no cost incurred (in terms of implementation or >> runtime overhead). > > Regarding of who should orchestrate restart(2). > > Special process who calls restart(2) should do it. It doesn't relate to > restarted process at all. It isn't, for example, init of future container. > Could also be (new) container init. The parent of that process will figure out the status of the operation (success/fail) and report. If any of the actual restarting tasks crashed/segfaults - very well, the parent will hide that from user and report 'failure'. > Reasons: > 1) somebody should write registers before final jump to userspace. > Task itself can't generally do it: struct pt_regs is in the same place > as kernel stack. > > cr_load_cpu_regs() does exactly this: as current writes to it's own > pt_regs. Oren, why don't you see crashes? LOL :) Maybe because it works ? > > I first tried to do it and was greeted with horrible crashes because > e.g current becoming NULL under current. That's why > cr_arch_restore_task_struct() is not done in current context. > > 2) Somebody should restore who is parent of whom, who ptraces whom, > resolved all possible loops and so on. Intuition tells me that it > should be context which is not involved into restart(2) other than > doing this post-restart-before-thaw part. Who is the parent of whom - determined by the fork/clone order. No loops, because when restarts actually starts (that is, in the kernel), all tasks have already been created, so they are readily available. ptrace property - (will be) restored as part of the regular task restore for each task, in order. Within each task, it will be one of the last things the task does to avoid generating spurious ptrace events while restarting. > > Consequently, it should not be init of futire container. It's just > another task after all, from the POV of reparenting code. > Not at all necessary, as explained above. There is, however, a need for a parent task that will monitor the operation and report success/failure to user. Oren. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Creating tasks on restart: userspace vs kernel 2009-04-14 18:40 ` Oren Laadan @ 2009-04-14 19:59 ` Alexey Dobriyan 2009-04-14 20:10 ` Oren Laadan 2009-04-15 19:56 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Alexey Dobriyan 1 sibling, 1 reply; 21+ messages in thread From: Alexey Dobriyan @ 2009-04-14 19:59 UTC (permalink / raw) To: Oren Laadan Cc: containers, Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel, Ingo Molnar > > In the end correctness of chopping will be equal to how good user > > understands that two task_struct's are independent of each other. > > > >> But it will still be a useful tool for many use cases, like batch cpu jobs, > >> some servers, vnc sessions (if you want graphics) etc. Imagine you run > >> 'octave' for a week and must reboot now - 'octave' wouldn't care if > >> you checkpointed it and then restart with a different pid ! > >> > >> <3> Clone with pid: > >> > >> To restart processes from userspace, there needs to be a way to > >> request a specific pid--in the current pid_ns--for the child process > >> (clearly, if it isn't in use). > >> > >> Why is it a disadvantage ? to Linus, a syscall clone_with_pid() > >> "sounds like a _wonderful_ attack vector against badly written > >> user-land software...". Actually, getting a specific pid is possible > >> without this syscall. But the point is that it's undesirable to have > >> this functionality unrestricted. > >> > >> So one option is to require root privileges. Another option is to > >> restrict such action in pid_ns created by the same user. Even more so, > >> restrict to only containers that are being restarted. > > > > You want to do small part in userspace and consequently end up with hacks > > both userspace-visible and in-kernel. > > I want to extend existing kernel interface to leverage fork/clone > from user space, AND to allow the flexibility mentioned above (which > you conveniently ignored). > > All hacks are in-kernel, aren't they ? mktree.c can be vieved as hack, why not? The whole existence of these requirements. You want new syscall or SET_NEX_PID or /proc file or something. > As for asking for a specific pid from user space, it can be done by: > * a new syscall (restricted to user-owned-namespace or CAP_SYS_ADMIN) > * a sys_restart(... SET_NEXT_PID) interface specific for restart (ugh) > * setting a special /proc/PID/next_id file which is consulted by fork /proc/*/next_id was disscussed and hopefully died, but no. > and in all cases, limit this so it can only allowed in a restarting > container, under the proper security model (again, e.g., Serge's > suggestion). > > > > > Pids aren't special, they are struct pid, dynamically allocated and > > refcounted just like any other structtures. > > > > They _become_ special for you intended method of restart. > > They are special. And I allow them not to be restored, as well, if > the use case so wishes. The use case is to restore as much as possible to the same state as equal as possible. Not going with fork_with_pid() in any form helps kernel to ensure correctness of restore and helps to avoid surprise failure modes from user POV. > > You also have flags in nsproxy image (or where?) like "do clone with > > CLONE_NEWUTS". > > Nope. Read the code. Which code? static int cr_write_namespaces(struct cr_ctx *ctx, struct task_struct *t) { ... new_uts = cr_obj_add_ptr(ctx, nsproxy->uts_ns, &hh->uts_ref, CR_OBJ_UTSNS, 0); if (new_uts < 0) { ret = new_uts; goto out; } hh->flags = 0; if (new_uts) ===> hh->flags |= CLONE_NEWUTS; ret = cr_write_obj(ctx, &h, hh); ... > > This is unneeded! > > > > nsproxy (or task_struct) image have reference (objref/position) to uts_ns image. > > > > On restart, one lookups object by reference or restore it if needed, > > takes refcount and glue. Just like with every other two structures. > > That's exactly how it's done. Not for uts_ns and future namespaces. ret = cr_restore_utsns(ctx, hh->uts_ref, hh->flags); ^^^^^^^^^ comes from disk > > No "what to do, what to do" logic. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Creating tasks on restart: userspace vs kernel 2009-04-14 19:59 ` Alexey Dobriyan @ 2009-04-14 20:10 ` Oren Laadan 2009-04-14 21:01 ` Alexey Dobriyan 0 siblings, 1 reply; 21+ messages in thread From: Oren Laadan @ 2009-04-14 20:10 UTC (permalink / raw) To: Alexey Dobriyan Cc: containers, Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel, Ingo Molnar Alexey Dobriyan wrote: >>> In the end correctness of chopping will be equal to how good user >>> understands that two task_struct's are independent of each other. >>> >>>> But it will still be a useful tool for many use cases, like batch cpu jobs, >>>> some servers, vnc sessions (if you want graphics) etc. Imagine you run >>>> 'octave' for a week and must reboot now - 'octave' wouldn't care if >>>> you checkpointed it and then restart with a different pid ! >>>> >>>> <3> Clone with pid: >>>> >>>> To restart processes from userspace, there needs to be a way to >>>> request a specific pid--in the current pid_ns--for the child process >>>> (clearly, if it isn't in use). >>>> >>>> Why is it a disadvantage ? to Linus, a syscall clone_with_pid() >>>> "sounds like a _wonderful_ attack vector against badly written >>>> user-land software...". Actually, getting a specific pid is possible >>>> without this syscall. But the point is that it's undesirable to have >>>> this functionality unrestricted. >>>> >>>> So one option is to require root privileges. Another option is to >>>> restrict such action in pid_ns created by the same user. Even more so, >>>> restrict to only containers that are being restarted. >>> You want to do small part in userspace and consequently end up with hacks >>> both userspace-visible and in-kernel. >> I want to extend existing kernel interface to leverage fork/clone >> from user space, AND to allow the flexibility mentioned above (which >> you conveniently ignored). >> >> All hacks are in-kernel, aren't they ? > > mktree.c can be vieved as hack, why not? Lol .. I meant "all kernel hacks are in-kernel" :) > > The whole existence of these requirements. You want new syscall or SET_NEX_PID > or /proc file or something. Or embed it into a restart(2) call with special argument. > >> As for asking for a specific pid from user space, it can be done by: >> * a new syscall (restricted to user-owned-namespace or CAP_SYS_ADMIN) >> * a sys_restart(... SET_NEXT_PID) interface specific for restart (ugh) >> * setting a special /proc/PID/next_id file which is consulted by fork > > /proc/*/next_id was disscussed and hopefully died, but no. > >> and in all cases, limit this so it can only allowed in a restarting >> container, under the proper security model (again, e.g., Serge's >> suggestion). >> >>> Pids aren't special, they are struct pid, dynamically allocated and >>> refcounted just like any other structtures. >>> >>> They _become_ special for you intended method of restart. >> They are special. And I allow them not to be restored, as well, if >> the use case so wishes. > > The use case is to restore as much as possible to the same state as > equal as possible. Not going with fork_with_pid() in any form helps > kernel to ensure correctness of restore and helps to avoid surprise > failure modes from user POV. > >>> You also have flags in nsproxy image (or where?) like "do clone with >>> CLONE_NEWUTS". >> Nope. Read the code. > > Which code? > > static int cr_write_namespaces(struct cr_ctx *ctx, struct task_struct *t) > { > ... > > new_uts = cr_obj_add_ptr(ctx, nsproxy->uts_ns, > &hh->uts_ref, CR_OBJ_UTSNS, 0); > if (new_uts < 0) { > ret = new_uts; > goto out; > } > > hh->flags = 0; > if (new_uts) > ===> hh->flags |= CLONE_NEWUTS; > > ret = cr_write_obj(ctx, &h, hh); > ... > >>> This is unneeded! >>> >>> nsproxy (or task_struct) image have reference (objref/position) to uts_ns image. >>> >>> On restart, one lookups object by reference or restore it if needed, >>> takes refcount and glue. Just like with every other two structures. >> That's exactly how it's done. > > Not for uts_ns and future namespaces. > > ret = cr_restore_utsns(ctx, hh->uts_ref, hh->flags); > ^^^^^^^^^ > comes from disk Where else would it come from ? that's part of the state saved during checkpoint. That's for nested UTS namespaces, where a task in container called unshare(). Oren. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Creating tasks on restart: userspace vs kernel 2009-04-14 20:10 ` Oren Laadan @ 2009-04-14 21:01 ` Alexey Dobriyan 0 siblings, 0 replies; 21+ messages in thread From: Alexey Dobriyan @ 2009-04-14 21:01 UTC (permalink / raw) To: Oren Laadan Cc: containers, Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel, Ingo Molnar On Tue, Apr 14, 2009 at 04:10:53PM -0400, Oren Laadan wrote: > > > Alexey Dobriyan wrote: > >>> In the end correctness of chopping will be equal to how good user > >>> understands that two task_struct's are independent of each other. > >>> > >>>> But it will still be a useful tool for many use cases, like batch cpu jobs, > >>>> some servers, vnc sessions (if you want graphics) etc. Imagine you run > >>>> 'octave' for a week and must reboot now - 'octave' wouldn't care if > >>>> you checkpointed it and then restart with a different pid ! > >>>> > >>>> <3> Clone with pid: > >>>> > >>>> To restart processes from userspace, there needs to be a way to > >>>> request a specific pid--in the current pid_ns--for the child process > >>>> (clearly, if it isn't in use). > >>>> > >>>> Why is it a disadvantage ? to Linus, a syscall clone_with_pid() > >>>> "sounds like a _wonderful_ attack vector against badly written > >>>> user-land software...". Actually, getting a specific pid is possible > >>>> without this syscall. But the point is that it's undesirable to have > >>>> this functionality unrestricted. > >>>> > >>>> So one option is to require root privileges. Another option is to > >>>> restrict such action in pid_ns created by the same user. Even more so, > >>>> restrict to only containers that are being restarted. > >>> You want to do small part in userspace and consequently end up with hacks > >>> both userspace-visible and in-kernel. > >> I want to extend existing kernel interface to leverage fork/clone > >> from user space, AND to allow the flexibility mentioned above (which > >> you conveniently ignored). > >> > >> All hacks are in-kernel, aren't they ? > > > > mktree.c can be vieved as hack, why not? > > Lol .. I meant "all kernel hacks are in-kernel" :) > > > > > The whole existence of these requirements. You want new syscall or SET_NEX_PID > > or /proc file or something. > > Or embed it into a restart(2) call with special argument. > > > > >> As for asking for a specific pid from user space, it can be done by: > >> * a new syscall (restricted to user-owned-namespace or CAP_SYS_ADMIN) > >> * a sys_restart(... SET_NEXT_PID) interface specific for restart (ugh) > >> * setting a special /proc/PID/next_id file which is consulted by fork > > > > /proc/*/next_id was disscussed and hopefully died, but no. > > > >> and in all cases, limit this so it can only allowed in a restarting > >> container, under the proper security model (again, e.g., Serge's > >> suggestion). > >> > >>> Pids aren't special, they are struct pid, dynamically allocated and > >>> refcounted just like any other structtures. > >>> > >>> They _become_ special for you intended method of restart. > >> They are special. And I allow them not to be restored, as well, if > >> the use case so wishes. > > > > The use case is to restore as much as possible to the same state as > > equal as possible. Not going with fork_with_pid() in any form helps > > kernel to ensure correctness of restore and helps to avoid surprise > > failure modes from user POV. > > > >>> You also have flags in nsproxy image (or where?) like "do clone with > >>> CLONE_NEWUTS". > >> Nope. Read the code. > > > > Which code? > > > > static int cr_write_namespaces(struct cr_ctx *ctx, struct task_struct *t) > > { > > ... > > > > new_uts = cr_obj_add_ptr(ctx, nsproxy->uts_ns, > > &hh->uts_ref, CR_OBJ_UTSNS, 0); > > if (new_uts < 0) { > > ret = new_uts; > > goto out; > > } > > > > hh->flags = 0; > > if (new_uts) > > ===> hh->flags |= CLONE_NEWUTS; > > > > ret = cr_write_obj(ctx, &h, hh); > > ... > > > >>> This is unneeded! > >>> > >>> nsproxy (or task_struct) image have reference (objref/position) to uts_ns image. > >>> > >>> On restart, one lookups object by reference or restore it if needed, > >>> takes refcount and glue. Just like with every other two structures. > >> That's exactly how it's done. > > > > Not for uts_ns and future namespaces. > > > > ret = cr_restore_utsns(ctx, hh->uts_ref, hh->flags); > > ^^^^^^^^^ > > comes from disk > > Where else would it come from ? that's part of the state saved during > checkpoint. This is bogus part saved during checkpoint. To restore nsproxy you only need references to uts_ns, ipc_ns, mnt_ns, pid_ns and net_nsm, no flags. You can try to be smart and, consequently, end up with checks where a) flags tell to unshare uts_ns, but reference is the same, and b) flags don't tell to unshare, but reference is different. This is unneeded code coming from the way you restore nsproxy incorrectly. > That's for nested UTS namespaces, Just to clear terminology, UTS namespaces aren't nested, only PID namespaces are. > where a task in container called unshare(). ^ permalink raw reply [flat|nested] 21+ messages in thread
* C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) 2009-04-14 18:40 ` Oren Laadan 2009-04-14 19:59 ` Alexey Dobriyan @ 2009-04-15 19:56 ` Alexey Dobriyan 2009-04-15 21:38 ` C/R without "leaks" Oren Laadan 2009-04-15 22:42 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Greg Kurz 1 sibling, 2 replies; 21+ messages in thread From: Alexey Dobriyan @ 2009-04-15 19:56 UTC (permalink / raw) To: Oren Laadan Cc: containers, Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel, Ingo Molnar > Again, so to checkpoint one task in the topmost pid-ns you need to > checkpoint (if at all possible) the entire system ?! One more argument to not allow "leaks" and checkpoint whole container, no ifs, buts and woulditbenices. Just to clarify, C/R with "leak" is for example when process has separate pidns, but shares, for example, netns with other process not involved in checkpoint. If you allow this, you lose one important property of checkpoint part, namely, almost everything is frozen. Losing this property means suddenly much more stuff is alive during dump and you has to account to more stuff when checkpointing. You effectively checkpointing on live data structures and there is no guarantee you'll get it right. Example 1: utsns is shared with the rest of the world. utsns content is modifiable only by tasks (current->nsproxy->uts_ns). Consequently, someone can modify utsns content while you're dumping it if you allow "leaks". Did you take precautions? Where? static int cr_write_utsns(struct cr_ctx *ctx, struct uts_namespace *uts_ns) { struct cr_hdr h; struct cr_hdr_utsns *hh; int domainname_len; int nodename_len; int ret; h.type = CR_HDR_UTSNS; h.len = sizeof(*hh); hh = cr_hbuf_get(ctx, sizeof(*hh)); if (!hh) return -ENOMEM; nodename_len = strlen(uts_ns->name.nodename) + 1; domainname_len = strlen(uts_ns->name.domainname) + 1; hh->nodename_len = nodename_len; hh->domainname_len = domainname_len; ret = cr_write_obj(ctx, &h, hh); cr_hbuf_put(ctx, sizeof(*hh)); if (ret < 0) return ret; ret = cr_write_string(ctx, uts_ns->name.nodename, nodename_len); if (ret < 0) return ret; ret = cr_write_string(ctx, uts_ns->name.domainname, domainname_len); return ret; } You should take uts_sem. Example 2: ipcns is shared with the rest of the world Consequently, shm segment is visible outside and live. Someone already shmatted to it. What will end up in shm segment content? Anything. You should check struct file refcount or something and disable attaching while dumping or something. Moral: Every time you do dump on something live you get complications. Every single time. There are sockets and live netns as the most complex example. I'm not prepared to describe it exactly, but people wishing to do C/R with "leaks" should be very careful with their wishes. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: C/R without "leaks" 2009-04-15 19:56 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Alexey Dobriyan @ 2009-04-15 21:38 ` Oren Laadan 2009-04-22 0:16 ` Nathan Lynch 2009-04-15 22:42 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Greg Kurz 1 sibling, 1 reply; 21+ messages in thread From: Oren Laadan @ 2009-04-15 21:38 UTC (permalink / raw) To: Alexey Dobriyan Cc: containers, Dave Hansen, Serge E. Hallyn, Andrew Morton, Linus Torvalds, Linux-Kernel, Ingo Molnar Alexey Dobriyan wrote: >> Again, so to checkpoint one task in the topmost pid-ns you need to >> checkpoint (if at all possible) the entire system ?! > > One more argument to not allow "leaks" and checkpoint whole container, > no ifs, buts and woulditbenices. > > Just to clarify, C/R with "leak" is for example when process has separate > pidns, but shares, for example, netns with other process not involved in > checkpoint. > > If you allow this, you lose one important property of checkpoint part, > namely, almost everything is frozen. Losing this property means suddenly > much more stuff is alive during dump and you has to account to more stuff > when checkpointing. You effectively checkpointing on live data structures > and there is no guarantee you'll get it right. Alexey, we're entirely on par about this: everyone agrees that if you want the maximal guarantee (if one exists) you must checkpoint entire container and have no leaks. The point I'm stressing is that there are other use cases, and other users, that can do great things even without full container. And my goal is to provide them this capability. Specially since the mechanism is shared by both cases. > > Example 1: utsns is shared with the rest of the world. > > utsns content is modifiable only by tasks (current->nsproxy->uts_ns). > Consequently, someone can modify utsns content while you're dumping it > if you allow "leaks". > > Did you take precautions? Where? > > static int cr_write_utsns(struct cr_ctx *ctx, struct uts_namespace *uts_ns) > { > struct cr_hdr h; > struct cr_hdr_utsns *hh; > int domainname_len; > int nodename_len; > int ret; > > h.type = CR_HDR_UTSNS; > h.len = sizeof(*hh); > > hh = cr_hbuf_get(ctx, sizeof(*hh)); > if (!hh) > return -ENOMEM; > > nodename_len = strlen(uts_ns->name.nodename) + 1; > domainname_len = strlen(uts_ns->name.domainname) + 1; > > hh->nodename_len = nodename_len; > hh->domainname_len = domainname_len; > > ret = cr_write_obj(ctx, &h, hh); > cr_hbuf_put(ctx, sizeof(*hh)); > if (ret < 0) > return ret; > > ret = cr_write_string(ctx, uts_ns->name.nodename, nodename_len); > if (ret < 0) > return ret; > > ret = cr_write_string(ctx, uts_ns->name.domainname, domainname_len); > return ret; > } > > You should take uts_sem. Fair enough. Will fix :) However, even with leaks count you need the uts_sem, because it if this is shared by another task when you start the checkpoint, but not shared by the time you do the leak check - then you missed it. And then, even the semaphore won't work unless you keep it for the entire duration of the checkpoint: if task A and B inside the container both know something about the UTS contents, and task C outside modified it before the checkpoint was taken, then, at least potentially, we have an inconsistency that neither you or I detect. The best part of it, however, it is unlikely that either A or B would ever *care* about that, especially in the case of UTS. And that brings me to the moral: in so many cases the user will live happily ever after even if the UTS is changes 50 times during the checkpoint. Because her tasks don't care about it. Remember that "flexibility" argument in my first post to this thread: the next step is that the user can say "cradvise(UTS, I_DONT_CARE)": during checkpoint the kernel won't save it, during restart the kernel won't restore it. Voila, so little effort to make people happy :) > > > Example 2: ipcns is shared with the rest of the world > > Consequently, shm segment is visible outside and live. Someone already > shmatted to it. What will end up in shm segment content? Anything. This is another excellent example. You are _so_ right that it doesn't make much sense to try to restart a program that relies on something that isn't part of the checkpoint. And yet, there are a handful programs, applications, processes that do not depend on the outside world in any important way, tasks that frankly, my dear, don't give a ... > > You should check struct file refcount or something and disable attaching > while dumping or something. Yes, yes, yes ! But -- when you focus solely on the full-container-only case. Deciding what's best for the users is a two-edged-sword. It works well to achieve foolproof operation with the less knowledgeable, but it's a bit of an arrogant approach for the more sophisticated ones. If you limit c/r to a full-container-only, you take away a freedom from the users - you take away a huge opportunity to use the c/r to its full potential. And you have this extra functionality for nearly free ! It's like giving the user a full blown linux laptop but disallowing use of the command line :p > > Moral: Every time you do dump on something live you get complications. > Every single time. "while(1);" will never have complications... :) And seriously, yes, you can bring endless examples of when it won't work. And others will bring their examples of when it will be ok even with "complications", because if you don't care about certain stuff, the "complication" becomes void. We can always restrict c/r later, either by code, or privileges, or system config, sysadmin policy, flag to checkpoint(2), you name it. So those who seek general case guarantee are happy. Why do it a-priori and block all other users ? is it of everyone's best interest to decide now that no-one should ever do so ? Oren. > > > There are sockets and live netns as the most complex example. I'm not > prepared to describe it exactly, but people wishing to do C/R with > "leaks" should be very careful with their wishes. > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: C/R without "leaks" 2009-04-15 21:38 ` C/R without "leaks" Oren Laadan @ 2009-04-22 0:16 ` Nathan Lynch 0 siblings, 0 replies; 21+ messages in thread From: Nathan Lynch @ 2009-04-22 0:16 UTC (permalink / raw) To: Oren Laadan Cc: Alexey Dobriyan, Linux-Kernel, Dave Hansen, containers, Andrew Morton, Linus Torvalds, Ingo Molnar Oren Laadan <orenl@cs.columbia.edu> writes: > Alexey Dobriyan wrote: >>> Again, so to checkpoint one task in the topmost pid-ns you need to >>> checkpoint (if at all possible) the entire system ?! >> >> One more argument to not allow "leaks" and checkpoint whole container, >> no ifs, buts and woulditbenices. >> >> Just to clarify, C/R with "leak" is for example when process has separate >> pidns, but shares, for example, netns with other process not involved in >> checkpoint. >> >> If you allow this, you lose one important property of checkpoint part, >> namely, almost everything is frozen. Losing this property means suddenly >> much more stuff is alive during dump and you has to account to more stuff >> when checkpointing. You effectively checkpointing on live data structures >> and there is no guarantee you'll get it right. > > Alexey, we're entirely on par about this: everyone agrees that if you > want the maximal guarantee (if one exists) you must checkpoint entire > container and have no leaks. > > The point I'm stressing is that there are other use cases, and other > users, that can do great things even without full container. And my > goal is to provide them this capability. As it seems that Alexey's goal is more or less a subset of yours, would it make sense in the near term to concentrate on getting an implementation upstream that satisfies that subset (i.e. checkpoint on a container basis only)? And then support for checkpointing arbitrary processes could be added later, if it proves feasible? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) 2009-04-15 19:56 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Alexey Dobriyan 2009-04-15 21:38 ` C/R without "leaks" Oren Laadan @ 2009-04-15 22:42 ` Greg Kurz 2009-04-16 16:12 ` Alexey Dobriyan 1 sibling, 1 reply; 21+ messages in thread From: Greg Kurz @ 2009-04-15 22:42 UTC (permalink / raw) To: Alexey Dobriyan Cc: Oren Laadan, Linux-Kernel, Dave Hansen, containers, Andrew Morton, Linus Torvalds, Ingo Molnar On Wed, 2009-04-15 at 23:56 +0400, Alexey Dobriyan wrote: > > Again, so to checkpoint one task in the topmost pid-ns you need to > > checkpoint (if at all possible) the entire system ?! > > One more argument to not allow "leaks" and checkpoint whole container, > no ifs, buts and woulditbenices. > > Just to clarify, C/R with "leak" is for example when process has separate > pidns, but shares, for example, netns with other process not involved in > checkpoint. > > If you allow this, you lose one important property of checkpoint part, > namely, almost everything is frozen. Losing this property means suddenly > much more stuff is alive during dump and you has to account to more stuff > when checkpointing. You effectively checkpointing on live data structures > and there is no guarantee you'll get it right. > > Example 1: utsns is shared with the rest of the world. > > utsns content is modifiable only by tasks (current->nsproxy->uts_ns). > Consequently, someone can modify utsns content while you're dumping it > if you allow "leaks". > > Did you take precautions? Where? > > static int cr_write_utsns(struct cr_ctx *ctx, struct uts_namespace *uts_ns) > { > struct cr_hdr h; > struct cr_hdr_utsns *hh; > int domainname_len; > int nodename_len; > int ret; > > h.type = CR_HDR_UTSNS; > h.len = sizeof(*hh); > > hh = cr_hbuf_get(ctx, sizeof(*hh)); > if (!hh) > return -ENOMEM; > > nodename_len = strlen(uts_ns->name.nodename) + 1; > domainname_len = strlen(uts_ns->name.domainname) + 1; > > hh->nodename_len = nodename_len; > hh->domainname_len = domainname_len; > > ret = cr_write_obj(ctx, &h, hh); > cr_hbuf_put(ctx, sizeof(*hh)); > if (ret < 0) > return ret; > > ret = cr_write_string(ctx, uts_ns->name.nodename, nodename_len); > if (ret < 0) > return ret; > > ret = cr_write_string(ctx, uts_ns->name.domainname, domainname_len); > return ret; > } > > You should take uts_sem. > > > Example 2: ipcns is shared with the rest of the world > > Consequently, shm segment is visible outside and live. Someone already > shmatted to it. What will end up in shm segment content? Anything. > > You should check struct file refcount or something and disable attaching > while dumping or something. > > > Moral: Every time you do dump on something live you get complications. > Every single time. > > > There are sockets and live netns as the most complex example. I'm not > prepared to describe it exactly, but people wishing to do C/R with > "leaks" should be very careful with their wishes. They should close their sockets before checkpoint and find/have some way to reconnect after. This implies some kind of C/R awareness in the code to be checkpointed. -- Gregory Kurz gkurz@fr.ibm.com Software Engineer @ IBM/Meiosys http://www.ibm.com Tel +33 (0)534 638 479 Fax +33 (0)561 400 420 "Anarchy is about taking complete responsibility for yourself." Alan Moore. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) 2009-04-15 22:42 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Greg Kurz @ 2009-04-16 16:12 ` Alexey Dobriyan 2009-04-16 18:10 ` C/R without "leaks" Chris Friesen 2009-04-17 8:46 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Greg Kurz 0 siblings, 2 replies; 21+ messages in thread From: Alexey Dobriyan @ 2009-04-16 16:12 UTC (permalink / raw) To: Greg Kurz Cc: Oren Laadan, Linux-Kernel, Dave Hansen, containers, Andrew Morton, Linus Torvalds, Ingo Molnar On Thu, Apr 16, 2009 at 12:42:17AM +0200, Greg Kurz wrote: > On Wed, 2009-04-15 at 23:56 +0400, Alexey Dobriyan wrote: > > There are sockets and live netns as the most complex example. I'm not > > prepared to describe it exactly, but people wishing to do C/R with > > "leaks" should be very careful with their wishes. > > They should close their sockets before checkpoint and find/have some way > to reconnect after. This implies some kind of C/R awareness in the code > to be checkpointed. How do you imagine sshd closing sockets and reconnecting? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: C/R without "leaks" 2009-04-16 16:12 ` Alexey Dobriyan @ 2009-04-16 18:10 ` Chris Friesen 2009-04-16 18:39 ` Oren Laadan 2009-04-17 8:46 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Greg Kurz 1 sibling, 1 reply; 21+ messages in thread From: Chris Friesen @ 2009-04-16 18:10 UTC (permalink / raw) To: Alexey Dobriyan Cc: Greg Kurz, Oren Laadan, Linux-Kernel, Dave Hansen, containers, Andrew Morton, Linus Torvalds, Ingo Molnar Alexey Dobriyan wrote: > On Thu, Apr 16, 2009 at 12:42:17AM +0200, Greg Kurz wrote: >> On Wed, 2009-04-15 at 23:56 +0400, Alexey Dobriyan wrote: > >>> There are sockets and live netns as the most complex example. I'm not >>> prepared to describe it exactly, but people wishing to do C/R with >>> "leaks" should be very careful with their wishes. >> They should close their sockets before checkpoint and find/have some way >> to reconnect after. This implies some kind of C/R awareness in the code >> to be checkpointed. > > How do you imagine sshd closing sockets and reconnecting? Don't you already have to handle the case where an sshd connection is checkpointed, then the system is shutdown and the restore doesn't happen until after the TCP timeout? Chris ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: C/R without "leaks" 2009-04-16 18:10 ` C/R without "leaks" Chris Friesen @ 2009-04-16 18:39 ` Oren Laadan 2009-04-17 9:15 ` Greg Kurz 0 siblings, 1 reply; 21+ messages in thread From: Oren Laadan @ 2009-04-16 18:39 UTC (permalink / raw) To: Chris Friesen Cc: Alexey Dobriyan, Greg Kurz, Linux-Kernel, Dave Hansen, containers, Andrew Morton, Linus Torvalds, Ingo Molnar Chris Friesen wrote: > Alexey Dobriyan wrote: >> On Thu, Apr 16, 2009 at 12:42:17AM +0200, Greg Kurz wrote: >>> On Wed, 2009-04-15 at 23:56 +0400, Alexey Dobriyan wrote: >> >>>> There are sockets and live netns as the most complex example. I'm not >>>> prepared to describe it exactly, but people wishing to do C/R with >>>> "leaks" should be very careful with their wishes. >>> They should close their sockets before checkpoint and find/have some way >>> to reconnect after. This implies some kind of C/R awareness in the code >>> to be checkpointed. >> >> How do you imagine sshd closing sockets and reconnecting? > > Don't you already have to handle the case where an sshd connection is > checkpointed, then the system is shutdown and the restore doesn't happen > until after the TCP timeout? Any connection in that case is, of course, lost, and it's up to the application to do something about it. If the application relies on the state of the connection, it will have to give up (e.g. sshd, and ssh, die). However, there are many application that can withstand connection lost without crashing. They simply retry (web browser, irc client, db clients). With time, there may be more applications that are 'c/r-aware'. Moreover, in some cases you could, on restart, use a wrapper to create a new connection to somewhere (*), then ask restart(2) to use that socket instead of the original, such that from the user point of view things continue to work well, transparently. (*) that somewhere, could be the original peer, or another server, if it has a way to somehow continue a cut connection, or a special wrapper server that you right for that purpose. Oren. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: C/R without "leaks" 2009-04-16 18:39 ` Oren Laadan @ 2009-04-17 9:15 ` Greg Kurz 2009-04-17 9:48 ` Oren Laadan 0 siblings, 1 reply; 21+ messages in thread From: Greg Kurz @ 2009-04-17 9:15 UTC (permalink / raw) To: Oren Laadan Cc: Chris Friesen, Alexey Dobriyan, Linux-Kernel, Dave Hansen, containers, Andrew Morton, Linus Torvalds, Ingo Molnar On Thu, 2009-04-16 at 14:39 -0400, Oren Laadan wrote: > Any connection in that case is, of course, lost, and it's up to the > application to do something about it. If the application relies on > the state of the connection, it will have to give up (e.g. sshd, and > ssh, die). > And that's a good thing since that's exactly what users expect from sshd : to give up the connection when something goes wrong. I wouldn't trust a sshd with the ability to initiate connections on its own... And anyway, I still don't see the scenario where C/R a sshd is useful... Please someone (Alexey ?), provide a detailed use case where people would want to checkpoint or migrate live TCP connections... Discussion on containers@ is very interesting but really lacks of what-is-the-bigger-picture arguments... These huge patchsets are very tricky and intrusive... who wants them mainline ? what's the use of C/R ? > However, there are many application that can withstand connection > lost without crashing. They simply retry (web browser, irc client, > db clients). With time, there may be more applications that are > 'c/r-aware'. > HPC jobs are definitely good candidates. > Moreover, in some cases you could, on restart, use a wrapper to > create a new connection to somewhere (*), then ask restart(2) to > use that socket instead of the original, such that from the user > point of view things continue to work well, transparently. > Yes. > (*) that somewhere, could be the original peer, or another server, > if it has a way to somehow continue a cut connection, or a special > wrapper server that you right for that purpose. > > Oren. > -- Gregory Kurz gkurz@fr.ibm.com Software Engineer @ IBM/Meiosys http://www.ibm.com Tel +33 (0)534 638 479 Fax +33 (0)561 400 420 "Anarchy is about taking complete responsibility for yourself." Alan Moore. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: C/R without "leaks" 2009-04-17 9:15 ` Greg Kurz @ 2009-04-17 9:48 ` Oren Laadan 2009-04-17 12:25 ` Greg Kurz 0 siblings, 1 reply; 21+ messages in thread From: Oren Laadan @ 2009-04-17 9:48 UTC (permalink / raw) To: Greg Kurz Cc: Chris Friesen, Alexey Dobriyan, Linux-Kernel, Dave Hansen, containers, Andrew Morton, Linus Torvalds, Ingo Molnar Greg Kurz wrote: > On Thu, 2009-04-16 at 14:39 -0400, Oren Laadan wrote: >> Any connection in that case is, of course, lost, and it's up to the >> application to do something about it. If the application relies on >> the state of the connection, it will have to give up (e.g. sshd, and >> ssh, die). >> > > And that's a good thing since that's exactly what users expect from > sshd : to give up the connection when something goes wrong. I wouldn't > trust a sshd with the ability to initiate connections on its own... > > And anyway, I still don't see the scenario where C/R a sshd is useful... You mean an sshd with an open connection probably; the server itself is clearly useful to be able to c/r. > Please someone (Alexey ?), provide a detailed use case where people > would want to checkpoint or migrate live TCP connections... Discussion > on containers@ is very interesting but really lacks of > what-is-the-bigger-picture arguments... These huge patchsets are very > tricky and intrusive... who wants them mainline ? what's the use of > C/R ? > A canonical example would a virtual-private-server: instead of doing server consolidation with a virtual machine, your do with containers. In a sense, containers lets you chop the OS into independent isolated pieces. You ca use a linux box to run multiple virtual execution environments (containers), each running services of your choice. They could range from a sshd for users, to apache servers, to database servers to users' vnc sessions, etc. Now comes the that you really need to take the machine down, for whatever reason. With c/r of live connections you can live-migrate these containers to another machine (on the same subnet) that will "steal" the IP as well, and voila - no service disruption. Such scenarios are the focus of Alexey. I'm also very interested in these scenarios, and I'm _also_ thinking of other scenarios, where either (a) an entire container is not necessary (example: user running long computation on laptop and wants to save it before a reboot), or (b) the program would like to make adjustments to its state compared to the time it was saved (example: change the location of an output log file depending on the machine on which your are running). Unfortunately, if we plan for and require, as per Alexey, that c/r would only work for whole-containers, these two cases will not be addressed. Oren. >> However, there are many application that can withstand connection >> lost without crashing. They simply retry (web browser, irc client, >> db clients). With time, there may be more applications that are >> 'c/r-aware'. >> > > HPC jobs are definitely good candidates. > >> Moreover, in some cases you could, on restart, use a wrapper to >> create a new connection to somewhere (*), then ask restart(2) to >> use that socket instead of the original, such that from the user >> point of view things continue to work well, transparently. >> > > Yes. > >> (*) that somewhere, could be the original peer, or another server, >> if it has a way to somehow continue a cut connection, or a special >> wrapper server that you right for that purpose. >> >> Oren. >> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: C/R without "leaks" 2009-04-17 9:48 ` Oren Laadan @ 2009-04-17 12:25 ` Greg Kurz 0 siblings, 0 replies; 21+ messages in thread From: Greg Kurz @ 2009-04-17 12:25 UTC (permalink / raw) To: Oren Laadan Cc: Chris Friesen, Alexey Dobriyan, Linux-Kernel, Dave Hansen, containers, Andrew Morton, Linus Torvalds, Ingo Molnar On Fri, 2009-04-17 at 05:48 -0400, Oren Laadan wrote: > You mean an sshd with an open connection probably; the server itself > is clearly useful to be able to c/r. > Yes I mean C/R of sshd with active connections. > > A canonical example would a virtual-private-server: instead of doing > server consolidation with a virtual machine, your do with containers. > In a sense, containers lets you chop the OS into independent isolated > pieces. You ca use a linux box to run multiple virtual execution > environments (containers), each running services of your choice. They > could range from a sshd for users, to apache servers, to database > servers to users' vnc sessions, etc. > Indeed, containers allow to implement VPS just like virtual machines: we call them system containers. Not much to say about that since they don't introduce new concepts to users. > Now comes the that you really need to take the machine down, for > whatever reason. With c/r of live connections you can live-migrate > these containers to another machine (on the same subnet) that will > "steal" the IP as well, and voila - no service disruption. > Theorically, yes. Practicaly, you need a lot more than *simply* capturing and restoring socket states for such a migration to be usable in the real world. > > Such scenarios are the focus of Alexey. > So Alexey should provide some realistic examples, with several hosts, routers, switches and overall network infrastructure. > I'm also very interested in these scenarios, and I'm _also_ thinking > of other scenarios, where either (a) an entire container is not > necessary (example: user running long computation on laptop and wants > to save it before a reboot), or (b) the program would like to make > adjustments to its state compared to the time it was saved (example: > change the location of an output log file depending on the machine > on which your are running). > I'm _only_ interested in these other scenarios for the moment. > Unfortunately, if we plan for and require, as per Alexey, that c/r > would only work for whole-containers, these two cases will not be > addressed. > Discussion must go on then. There's no hurry in getting C/R mainlined. :) -- Gregory Kurz gkurz@fr.ibm.com Software Engineer @ IBM/Meiosys http://www.ibm.com Tel +33 (0)534 638 479 Fax +33 (0)561 400 420 "Anarchy is about taking complete responsibility for yourself." Alan Moore. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) 2009-04-16 16:12 ` Alexey Dobriyan 2009-04-16 18:10 ` C/R without "leaks" Chris Friesen @ 2009-04-17 8:46 ` Greg Kurz 1 sibling, 0 replies; 21+ messages in thread From: Greg Kurz @ 2009-04-17 8:46 UTC (permalink / raw) To: Alexey Dobriyan Cc: Oren Laadan, Linux-Kernel, Dave Hansen, containers, Andrew Morton, Linus Torvalds, Ingo Molnar On Thu, 2009-04-16 at 20:12 +0400, Alexey Dobriyan wrote: > On Thu, Apr 16, 2009 at 12:42:17AM +0200, Greg Kurz wrote: > > On Wed, 2009-04-15 at 23:56 +0400, Alexey Dobriyan wrote: > > > > There are sockets and live netns as the most complex example. I'm not > > > prepared to describe it exactly, but people wishing to do C/R with > > > "leaks" should be very careful with their wishes. > > > > They should close their sockets before checkpoint and find/have some way > > to reconnect after. This implies some kind of C/R awareness in the code > > to be checkpointed. > > How do you imagine sshd closing sockets and reconnecting? Dunno and it isn't really my concern... I'm interested in HPC jobs that can collaborate with the C/R feature. For examples, those jobs that use interconnect hardware that will never be *checkpointable*... Usually, the batch manager tells the jobs it's going to be checkpointed, so that it can disconnect/shrink memory/reach quiescent point, and reconnect after resuming execution. I understand you aim at supporting transparent C/R of connected TCP sockets. Nice feature. Could you give use cases where it's *really* helpful/needed/mandatory ? -- Gregory Kurz gkurz@fr.ibm.com Software Engineer @ IBM/Meiosys http://www.ibm.com Tel +33 (0)534 638 479 Fax +33 (0)561 400 420 "Anarchy is about taking complete responsibility for yourself." Alan Moore. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2009-04-22 0:16 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-04-14 3:43 Creating tasks on restart: userspace vs kernel Oren Laadan 2009-04-14 9:59 ` Ingo Molnar 2009-04-14 14:53 ` Oren Laadan 2009-04-14 16:16 ` Serge E. Hallyn 2009-04-14 16:36 ` Alexey Dobriyan 2009-04-14 16:46 ` Alexey Dobriyan 2009-04-14 18:40 ` Oren Laadan 2009-04-14 19:59 ` Alexey Dobriyan 2009-04-14 20:10 ` Oren Laadan 2009-04-14 21:01 ` Alexey Dobriyan 2009-04-15 19:56 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Alexey Dobriyan 2009-04-15 21:38 ` C/R without "leaks" Oren Laadan 2009-04-22 0:16 ` Nathan Lynch 2009-04-15 22:42 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Greg Kurz 2009-04-16 16:12 ` Alexey Dobriyan 2009-04-16 18:10 ` C/R without "leaks" Chris Friesen 2009-04-16 18:39 ` Oren Laadan 2009-04-17 9:15 ` Greg Kurz 2009-04-17 9:48 ` Oren Laadan 2009-04-17 12:25 ` Greg Kurz 2009-04-17 8:46 ` C/R without "leaks" (was: Re: Creating tasks on restart: userspace vs kernel) Greg Kurz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox