* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] <Pine.LNX.4.64.1011021530470.12128@takamine.ncl.cs.columbia.edu> @ 2010-11-02 21:35 ` Tejun Heo 2010-11-02 21:47 ` Christoph Hellwig ` (2 more replies) 2010-11-08 16:55 ` Grant Likely 1 sibling, 3 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-02 21:35 UTC (permalink / raw) To: Oren Laadan; +Cc: ksummit-2010-discuss, linux-kernel (cc'ing lkml too) Hello, On 11/02/2010 08:30 PM, Oren Laadan wrote: > Following the discussion yesterday, here is a linux-cr diff that > that is limited to changes to existing code. > > The diff doesn't include the eclone() patches. I also tried to strip > off the new c/r code (either code in new files, or new code within > #ifdef CONFIG_CHECKPOINT in existing files). > > I left a few such snippets in, e.g. c/r syscalls templates and > declaration of c/r specific methods in, e.g. file_operations. > > The remaining changes in this patch include new freezer state > ("CHECKPOINTING"), mostly refactoring of exsiting code, and a bit > of new helpers. > > Disclaimer: don't try to compile (or apply) - this is only intended > to give a ballpark of how the c/r patches change existing code. The patch size itself isn't too big but I still think it's one scary patch mostly because the breadth of the code checkpointing needs to modify and I suspect that probably is the biggest concern regarding checkpoint-restart from implementation point of view. FWIW, I'm not quite convinced checkpoint-restart can be something which can be generally useful. In controlled environments where the target application behavior can be relatively well defined and contained (including actions necessary to rollback in case something goes bonkers), it would work and can be quite useful, but I'm afraid the states which need to be saved and restored aren't defined well enough to be generally applicable. Not only is it a difficult problem, it actually is impossible to define common set of states to be saved and restored - it depends on each application. As such, I have difficult time believing it can be something generally useful. IOW, I think talking about its usage in complex environments like common desktops is mostly handwaving. What about X sessions, network connections, states established in other applications via dbus or whatnot? Which files need to be snapshotted together? What about shared mmaps? These questions are not difficult to answer in generic way, they are impossible. There is a very distinctive difference between system wide suspend/hibernation and process checkpointing. Most programs are already written with the conditions in mind which can be caused by system level suspend/hibernation. Most programs don't expect to be scheduled and run in any definite amount of time. There usually are provisions for loss or failure of resources which are out of the local system. There are corner cases which are affected and those programs contain code to respond to suspend/hibernation. Please note that this is about userland application behavior but not implementation detail in the kernel. It is a much more fundamental property. So, although checkpoint-restart can be very useful for certain circumstances, I don't believe there can be a general implementation. It inevitably needs to put somewhat strict restrictions on what the applications being checkpointed are allowed to do. And after my train of thought reaches there, I fail to see what the advantages of in-kernel implementation would be compared to something like the following. http://dmtcp.sourceforge.net/ Sure, in-kernel implementation would be able to fake it better, but I don't think it's anything major. The coverage would be slightly better but breaking the illusion wouldn't take much. Just push it a bit further and it will break all the same. In addition, to be useful, it would need userland framework or set of workarounds which are aware of and can manipulate userland states anyway. For workloads for which checkpointing would be most beneficial (HPC for example), I think something like the above would do just fine and it would make much more sense to add small features to make userland checkpointing work better than doing the whole thing in the kernel. I think in-kernel checkpointing is in awkward place in terms of tradeoff between its benefits and the added complexities to implement it. If you give up coverage slightly, userland checkpointing is there. If you need reliable coverage, proper virtualization isn't too far away. As such, FWIW, I fail to see enough justification for the added complexity. I'll be happy to be proven wrong tho. :-) Thank you. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-02 21:35 ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Tejun Heo @ 2010-11-02 21:47 ` Christoph Hellwig 2010-11-04 1:47 ` Nathan Lynch 2010-11-04 4:34 ` Oren Laadan 2010-11-04 3:40 ` Kapil Arya 2010-11-04 4:03 ` Oren Laadan 2 siblings, 2 replies; 111+ messages in thread From: Christoph Hellwig @ 2010-11-02 21:47 UTC (permalink / raw) To: Tejun Heo; +Cc: Oren Laadan, ksummit-2010-discuss, linux-kernel Thanks Tejun, your writeup brought up a lot of the same issues that I see with the in-kernel C/R. Various C/R implementations that are entirely in userspace or with limited kernel assistance have been in production in HPC environments for years. I think especially for these workloads C/R is an extremly useful feature, and a standard implementation would do Linux well. But I think the "transparent" in-kernel one is the wrong approach. It tries to give the illusion that C/R will just work, while a lot of things are simply not support. In this case whitelisting the allowed state by requiring special APIs for all I/O (or even just standard APIs as long as they are supposed by the C/R lib you're linked against) is the more pragmatic, and I think faithful aproach. In addition to the amount of state not supported despite looking transparant the other big problem with the patchset is that it saves the kernel internal state which changes all the time from one release to another. The handwaiving is that a userspace tool will solve it. I'm pretty sure that's not the case; it might solve a few cases but the general version n to version m conversion is impossible to maintain. Just look at the problem qemu has migration between just a handfull of version of the relatively well (compared to random kernel state) defined vmstate format. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-02 21:47 ` Christoph Hellwig @ 2010-11-04 1:47 ` Nathan Lynch 2010-11-04 7:36 ` Tejun Heo 2010-11-04 4:34 ` Oren Laadan 1 sibling, 1 reply; 111+ messages in thread From: Nathan Lynch @ 2010-11-04 1:47 UTC (permalink / raw) To: Christoph Hellwig Cc: Tejun Heo, Oren Laadan, ksummit-2010-discuss, linux-kernel On Tue, 2010-11-02 at 22:47 +0100, Christoph Hellwig wrote: > Thanks Tejun, > > your writeup brought up a lot of the same issues that I see with > the in-kernel C/R. Various C/R implementations that are entirely > in userspace or with limited kernel assistance have been in production > in HPC environments for years. FWIW there are a couple of kernel-based C/R implementations (BLCR, OpenVZ) in use in various contexts (not just HPC). > I think especially for these workloads > C/R is an extremly useful feature, and a standard implementation would > do Linux well. > > But I think the "transparent" in-kernel one is the wrong approach. It > tries to give the illusion that C/R will just work, while a lot of > things are simply not support. I think this is somewhat true of the implementation under consideration here (although generally it should fail checkpoints that it can't restart), but it needn't be true of all possible kernel-based implementations. > In this case whitelisting the allowed > state by requiring special APIs for all I/O (or even just standard > APIs as long as they are supposed by the C/R lib you're linked against) > is the more pragmatic, and I think faithful aproach. I don't think users will go for it. They'll continue to use dodgy out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting their applications to a new library. I think a C/R library is an "ideal" solution, but it's one that nobody would use - especially in HPC, unless the library somehow provides better performance. The namespace/isolation features of Linux (CLONE_NEWPID et al) already provide a pretty workable basis for creating tractably checkpoint- and-restartable jobs, with a minimum of performance overhead and application modification. > In addition to > the amount of state not supported despite looking transparant the > other big problem with the patchset is that it saves the kernel internal > state which changes all the time from one release to another. Most of the objects that the patchset saves and restores are right at the "border" of the user/kernel interface, and they're not apt to change much quickly (e.g. vma start and end, task sigaltstack info). The patchset certainly isn't serializing deep internal state such as wait queues, locks, or reference counts. > The handwaiving is that a userspace tool will solve it. I'm pretty sure > that's not the case; it might solve a few cases but the general > version n to version m conversion is impossible to maintain. With this I agree, though. But if a change in kernel implementation details forces an incompatible change in the checkpoint image format, is that really a big deal? Would it be so bad to say that a checkpoint image may be restarted only on the same kernel version that created it? With -stable or enterprise kernels I suspect the issue is unlikely to come up. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 1:47 ` Nathan Lynch @ 2010-11-04 7:36 ` Tejun Heo 2010-11-04 16:04 ` Gene Cooperman 2010-11-04 20:45 ` Nathan Lynch 0 siblings, 2 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-04 7:36 UTC (permalink / raw) To: Nathan Lynch Cc: Christoph Hellwig, Oren Laadan, ksummit-2010-discuss, linux-kernel, kapil, gene Hello, On 11/04/2010 02:47 AM, Nathan Lynch wrote: >> In this case whitelisting the allowed >> state by requiring special APIs for all I/O (or even just standard >> APIs as long as they are supposed by the C/R lib you're linked against) >> is the more pragmatic, and I think faithful aproach. > > I don't think users will go for it. They'll continue to use dodgy > out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting > their applications to a new library. I think a C/R library is an > "ideal" solution, but it's one that nobody would use - especially in > HPC, unless the library somehow provides better performance. I hear that there are plans to integrate one of the userland snapshotting implementations with HPC workload manager. ISTR the combination to be condor + dmtcp but not sure. I think things like that make a lot of sense. Scientists writing programs for HPC clusters already work in given frameworks and what those applications do and how to recover are pretty well confined/defined. If you integrate snapshotting with such frameworks, it becomes pretty easy for both the admins and users. I'll talk about other issues in the reply to Oren's email. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 7:36 ` Tejun Heo @ 2010-11-04 16:04 ` Gene Cooperman 2010-11-04 20:45 ` Nathan Lynch 1 sibling, 0 replies; 111+ messages in thread From: Gene Cooperman @ 2010-11-04 16:04 UTC (permalink / raw) To: Tejun Heo Cc: Nathan Lynch, Christoph Hellwig, Oren Laadan, ksummit-2010-discuss, linux-kernel, kapil, gene Yes, we are working with Condor to have them validate DMTCP. Time will tell. - Gene On Thu, Nov 04, 2010 at 08:36:16AM +0100, Tejun Heo wrote: > Hello, > > On 11/04/2010 02:47 AM, Nathan Lynch wrote: > >> In this case whitelisting the allowed > >> state by requiring special APIs for all I/O (or even just standard > >> APIs as long as they are supposed by the C/R lib you're linked against) > >> is the more pragmatic, and I think faithful aproach. > > > > I don't think users will go for it. They'll continue to use dodgy > > out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting > > their applications to a new library. I think a C/R library is an > > "ideal" solution, but it's one that nobody would use - especially in > > HPC, unless the library somehow provides better performance. > > I hear that there are plans to integrate one of the userland > snapshotting implementations with HPC workload manager. ISTR the > combination to be condor + dmtcp but not sure. I think things like > that make a lot of sense. Scientists writing programs for HPC > clusters already work in given frameworks and what those applications > do and how to recover are pretty well confined/defined. If you > integrate snapshotting with such frameworks, it becomes pretty easy > for both the admins and users. > > I'll talk about other issues in the reply to Oren's email. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 7:36 ` Tejun Heo 2010-11-04 16:04 ` Gene Cooperman @ 2010-11-04 20:45 ` Nathan Lynch 2010-11-06 6:48 ` Matt Helsley 1 sibling, 1 reply; 111+ messages in thread From: Nathan Lynch @ 2010-11-04 20:45 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Hellwig, Oren Laadan, ksummit-2010-discuss, linux-kernel, kapil, gene On Thu, 2010-11-04 at 08:36 +0100, Tejun Heo wrote: > Hello, > > On 11/04/2010 02:47 AM, Nathan Lynch wrote: > >> In this case whitelisting the allowed > >> state by requiring special APIs for all I/O (or even just standard > >> APIs as long as they are supposed by the C/R lib you're linked against) > >> is the more pragmatic, and I think faithful aproach. > > > > I don't think users will go for it. They'll continue to use dodgy > > out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting > > their applications to a new library. I think a C/R library is an > > "ideal" solution, but it's one that nobody would use - especially in > > HPC, unless the library somehow provides better performance. > > I hear that there are plans to integrate one of the userland > snapshotting implementations with HPC workload manager. ISTR the > combination to be condor + dmtcp but not sure. I think things like > that make a lot of sense. If you look at the C/R implementations of those two projects you'll see that they don't implement what I take to be hch's suggestion - a library or platform with special-purpose APIs to which applications are ported in order to gain C/R ability. For all their good points, the projects you mention do interposition for glibc's syscall wrappers and provide a few optional hooks so apps can control certain aspects of C/R. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 20:45 ` Nathan Lynch @ 2010-11-06 6:48 ` Matt Helsley 0 siblings, 0 replies; 111+ messages in thread From: Matt Helsley @ 2010-11-06 6:48 UTC (permalink / raw) To: Nathan Lynch Cc: Tejun Heo, Christoph Hellwig, Oren Laadan, ksummit-2010-discuss, linux-kernel, kapil, gene On Thu, Nov 04, 2010 at 03:45:37PM -0500, Nathan Lynch wrote: > On Thu, 2010-11-04 at 08:36 +0100, Tejun Heo wrote: > > Hello, > > > > On 11/04/2010 02:47 AM, Nathan Lynch wrote: > > >> In this case whitelisting the allowed > > >> state by requiring special APIs for all I/O (or even just standard > > >> APIs as long as they are supposed by the C/R lib you're linked against) > > >> is the more pragmatic, and I think faithful aproach. > > > > > > I don't think users will go for it. They'll continue to use dodgy > > > out-of-tree kernel modules and/or LD_PRELOAD hacks instead of porting > > > their applications to a new library. I think a C/R library is an > > > "ideal" solution, but it's one that nobody would use - especially in > > > HPC, unless the library somehow provides better performance. > > > > I hear that there are plans to integrate one of the userland > > snapshotting implementations with HPC workload manager. ISTR the > > combination to be condor + dmtcp but not sure. I think things like > > that make a lot of sense. > > If you look at the C/R implementations of those two projects you'll see > that they don't implement what I take to be hch's suggestion - a library > or platform with special-purpose APIs to which applications are ported > in order to gain C/R ability. For all their good points, the projects And even if they did, I don't think asking application developers to use such a broad API -- one that requires special APIs for all I/O -- is practical for many of the purposes outlined at kernel summit. I think DMTCP is better off for not attempting to mandate such APIs. How rare is it for an application or library to change the underlying APIs it uses? How many applications have been ported say from Gnome to KDE (or vice-versa) over the lifetime of the project? Relative to all the other applications? I would hazard a guess that most were rewritten rather than ported and that those that were ported are an utterly insignificant fraction of what's out there. It's much better to offer tools that, as much as possible, don't care which APIs the applications use. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-02 21:47 ` Christoph Hellwig 2010-11-04 1:47 ` Nathan Lynch @ 2010-11-04 4:34 ` Oren Laadan 2010-11-04 14:25 ` Christoph Hellwig 1 sibling, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-04 4:34 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Tejun Heo, ksummit-2010-discuss, linux-kernel Hi Christoph, I really wish you would have raised these concerns during the ksummit or thereafter. I'm here (LPC) until Friday, and would be happy to discuss any aspect of the linux-cr while at it (and if needed can post a summary to the list). On 11/02/2010 05:47 PM, Christoph Hellwig wrote: > Thanks Tejun, > > your writeup brought up a lot of the same issues that I see with > the in-kernel C/R. Various C/R implementations that are entirely > in userspace or with limited kernel assistance have been in production > in HPC environments for years. I think especially for these workloads > C/R is an extremly useful feature, and a standard implementation would > do Linux well. > > But I think the "transparent" in-kernel one is the wrong approach. It > tries to give the illusion that C/R will just work, while a lot of > things are simply not support. The fact is that an in-kernel implementation can and does support a significantly larger feature-set. Linux-cr does not and will not support everything. Nearly all driver devices won't be supported in the near future (but interested vendors could builds such functionality into their drivers!). Also, pseudo file systems like sysfs, procfs, debugfs will at most get partial support. But apart for that, it really covers (or will soon) nearly everything. So I do wonder what concretely is "a lot of things" ? > In this case whitelisting the allowed > state by requiring special APIs for all I/O (or even just standard > APIs as long as they are supposed by the C/R lib you're linked against) > is the more pragmatic, and I think faithful aproach. In addition to > the amount of state not supported despite looking transparant the "Transparent" means that applications don't know that they are being checkpointed, nor do they need to cooperate. So linux-cr is *completely* transparent to applications that are checkpointable. Perhaps you can elaborate on the "state not supported despite looking transparent" - beyond what I mentioned above ? > other big problem with the patchset is that it saves the kernel internal > state which changes all the time from one release to another. It is our experience that the format is pretty immune to changes that occur to in-kernel (and not user/ABI visible) structures. It mainly changes when we add new features - and I expect that to happen less frequently once the patchset finds its way to the mainline. > The > handwaiving is that a userspace tool will solve it. I'm pretty sure > that's not the case; it might solve a few cases but the general > version n to version m conversion is impossible to maintain. Just look > at the problem qemu has migration between just a handfull of version > of the relatively well (compared to random kernel state) defined vmstate > format. The problem space is smaller, because we are aiming at a simpler goal. We need to always know how to convert from version N to version N+1. Then conversion from N to N+k is a series of these conversions. QEMU has a broader goal: IIUC, both QEMU and KVM versions may change, they are not tied to each other. So the problem is harder. In linux-cr, the format is tied to the version of objects that the kernel that outputs/inputs the data knows. That makes things much simpler. Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 4:34 ` Oren Laadan @ 2010-11-04 14:25 ` Christoph Hellwig 0 siblings, 0 replies; 111+ messages in thread From: Christoph Hellwig @ 2010-11-04 14:25 UTC (permalink / raw) To: Oren Laadan Cc: Christoph Hellwig, Tejun Heo, ksummit-2010-discuss, linux-kernel On Thu, Nov 04, 2010 at 12:34:51AM -0400, Oren Laadan wrote: > Hi Christoph, > > I really wish you would have raised these concerns during the > ksummit or thereafter. I'm here (LPC) until Friday, and would be > happy to discuss any aspect of the linux-cr while at it (and if > needed can post a summary to the list). Discussion technical topics with slides in a big room is utterly pointless. Just like during all the other such boring talks during KS I was either asleep, working on something important or out of the room doing the extended hallway track. If you want to discuss invasive kernel changes with people do it by email. The chance that anyone is going to listen to you is a lot higher. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-02 21:35 ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Tejun Heo 2010-11-02 21:47 ` Christoph Hellwig @ 2010-11-04 3:40 ` Kapil Arya 2010-11-04 8:05 ` Tejun Heo 2010-11-04 4:03 ` Oren Laadan 2 siblings, 1 reply; 111+ messages in thread From: Kapil Arya @ 2010-11-04 3:40 UTC (permalink / raw) To: Tejun Heo Cc: Oren Laadan, ksummit-2010-discuss, linux-kernel, Gene Cooperman, Kapil Arya (Sorry for resending the message; the last message contained some html tags and was rejected by server) We would like to thank the previous post for bringing up the topic of kernel C/R versus userland C/R. We are two of the developers of DMTCP (userland checkpointing): Distributed MultiThreaded CheckPointing . http://dmtcp.sourceforge.net We had waited to write to the kernel developers because we had wanted to ensure that DMTCP is sufficiently robust before wasting the time of the kernel developers. This thread seems like a good opportunity to begin a dialogue. In fact, we only became aware of Linux kernel C/R this September. Of course, we were aware of Oren Laadan's fine earlier work on ZapC for distributed checkpointing using the Linux kernel (CLUSTER-2005). We have a high respect for Oren Laadan and the other Linux C/R developers, as well as for the developers of BLCR (a C/R kernel module with a userland component that is widely used in HPC batch faciliites). By coincidence, when we became aware of Linux C/R, we were already in the middle of development for a major new release of DMTCP (from version 1.1.x to 1.2.0). We just finished that release. Among other features, this release supports checkpointing of GNU 'screen', and we have tested screen in some common use cases (with vim, with emacs, etc.). While it supports ssh (e.g. checkpointing OpenMPI, which uses ssh), it doesn't yet support _interactive_ ssh sessions. That will come in the next release. We believe that both Linux C/R and DMTCP are becoming quite mature, and that in general, one can achieve good application coverage with either. In our personal view, a key difference between in-kernel and userland approaches is the issue of security. The Linux C/R developers state the issue very well in their FAQ (question number 7): > https://ckpt.wiki.kernel.org/index.php/Faq : > 7. Can non-root users checkpoint/restart an application ? > > For now, only users with CAP_SYSADMIN privileges can C/R an > application. This is to ensure that the checkpoint image has not been > tampered with and will be treated like a loadable kernel-module. The previous posts also brought up the issue of external connections. While DMTCP has been developed over six years, in the last year we have concentrated especially on the issue of external connections. While we've accumulated many war stories, one will illustrate the point. Most Linux distros link vi to vim. Vim supports mouse and other operations via the X11 server. When vim starts up, it connects to the X11 server (which may be local, or remote if ssh uses X11 forwarding). On transparent checkpoint and restart, vim expects to continue talking to the X11 server. Currently, DMTCP recognizes such X11 server connections and refuses them. Vim still survives without its mouse and other X11 services. For the future, we are considering a more flexible approach that will take account of the X11 protocol. Strategies like these are easily handled in userspace. We suspect that while one may begin with a pure kernel approach, eventually, one will still want to add a userland component to achieve this kind of flexibility, just as BLCR has already done. Best wishes, - Gene Cooperman and Kapil Arya from the DMTCP team On Tue, Nov 2, 2010 at 5:35 PM, Tejun Heo <tj@kernel.org> wrote: > (cc'ing lkml too) > Hello, > > On 11/02/2010 08:30 PM, Oren Laadan wrote: >> Following the discussion yesterday, here is a linux-cr diff that >> that is limited to changes to existing code. >> >> The diff doesn't include the eclone() patches. I also tried to strip >> off the new c/r code (either code in new files, or new code within >> #ifdef CONFIG_CHECKPOINT in existing files). >> >> I left a few such snippets in, e.g. c/r syscalls templates and >> declaration of c/r specific methods in, e.g. file_operations. >> >> The remaining changes in this patch include new freezer state >> ("CHECKPOINTING"), mostly refactoring of exsiting code, and a bit >> of new helpers. >> >> Disclaimer: don't try to compile (or apply) - this is only intended >> to give a ballpark of how the c/r patches change existing code. > > The patch size itself isn't too big but I still think it's one scary > patch mostly because the breadth of the code checkpointing needs to > modify and I suspect that probably is the biggest concern regarding > checkpoint-restart from implementation point of view. > > FWIW, I'm not quite convinced checkpoint-restart can be something > which can be generally useful. In controlled environments where the > target application behavior can be relatively well defined and > contained (including actions necessary to rollback in case something > goes bonkers), it would work and can be quite useful, but I'm afraid > the states which need to be saved and restored aren't defined well > enough to be generally applicable. Not only is it a difficult > problem, it actually is impossible to define common set of states to > be saved and restored - it depends on each application. > > As such, I have difficult time believing it can be something generally > useful. IOW, I think talking about its usage in complex environments > like common desktops is mostly handwaving. What about X sessions, > network connections, states established in other applications via dbus > or whatnot? Which files need to be snapshotted together? What about > shared mmaps? These questions are not difficult to answer in generic > way, they are impossible. > > There is a very distinctive difference between system wide > suspend/hibernation and process checkpointing. Most programs are > already written with the conditions in mind which can be caused by > system level suspend/hibernation. Most programs don't expect to be > scheduled and run in any definite amount of time. There usually > are provisions for loss or failure of resources which are out of the > local system. There are corner cases which are affected and those > programs contain code to respond to suspend/hibernation. Please note > that this is about userland application behavior but not > implementation detail in the kernel. It is a much more fundamental > property. > > So, although checkpoint-restart can be very useful for certain > circumstances, I don't believe there can be a general implementation. > It inevitably needs to put somewhat strict restrictions on what the > applications being checkpointed are allowed to do. And after my > train of thought reaches there, I fail to see what the advantages of > in-kernel implementation would be compared to something like the > following. > > http://dmtcp.sourceforge.net/ > > Sure, in-kernel implementation would be able to fake it better, but I > don't think it's anything major. The coverage would be slightly > better but breaking the illusion wouldn't take much. Just push it a > bit further and it will break all the same. In addition, to be > useful, it would need userland framework or set of workarounds which > are aware of and can manipulate userland states anyway. For workloads > for which checkpointing would be most beneficial (HPC for example), I > think something like the above would do just fine and it would make > much more sense to add small features to make userland checkpointing > work better than doing the whole thing in the kernel. > > I think in-kernel checkpointing is in awkward place in terms of > tradeoff between its benefits and the added complexities to implement > it. If you give up coverage slightly, userland checkpointing is > there. If you need reliable coverage, proper virtualization isn't too > far away. As such, FWIW, I fail to see enough justification for the > added complexity. I'll be happy to be proven wrong tho. :-) > > Thank you. > > -- > tejun > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > > ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 3:40 ` Kapil Arya @ 2010-11-04 8:05 ` Tejun Heo 2010-11-04 16:44 ` Gene Cooperman 2010-11-05 22:24 ` Oren Laadan 0 siblings, 2 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-04 8:05 UTC (permalink / raw) To: Kapil Arya Cc: Oren Laadan, ksummit-2010-discuss, linux-kernel, Gene Cooperman, hch Hello, On 11/04/2010 04:40 AM, Kapil Arya wrote: > (Sorry for resending the message; the last message contained some html > tags and was rejected by server) And please also don't top-post. Being the antisocial egomaniacs we are, people on lkml prefer to dissect the messages we're replying to, insert insulting comments right where they would be most effective and remove the passages which can't yield effective insults. :-) > In our personal view, a key difference between in-kernel and userland > approaches is the issue of security. The Linux C/R developers state > the issue very well in their FAQ (question number 7): >> https://ckpt.wiki.kernel.org/index.php/Faq : >> 7. Can non-root users checkpoint/restart an application ? >> >> For now, only users with CAP_SYSADMIN privileges can C/R an >> application. This is to ensure that the checkpoint image has not been >> tampered with and will be treated like a loadable kernel-module. That's an interesting point but I don't think it's a dealbreaker. Kernel CR is gonna require userland agent anyway and access control can be done there. Being able to snapshot w/o root privieldge definitely is a plust but it's not like CR is gonna be deployed on majority of desktops and servers (if so, let's talk about it then). > Strategies like these are easily handled in userspace. We suspect > that while one may begin with a pure kernel approach, eventually, > one will still want to add a userland component to achieve this kind > of flexibility, just as BLCR has already done. Yeap, agreed. There gotta be user agents which can monitor and manipulate userland states. It's a fundamentally nasty job, that of collecting and applying application-specific workarounds. I've only glanced the dmtcp paper so my understanding is pretty superficial. With that in mind, can you please answer some of my curiosities? * As Oren pointed out in another message, there are somethings which could seem a bit too visible to the target application. Like the manager thread (is it visible to the application or is it hidden by the libc wrapper?) and reserved signal. Also, while it's true that all programs should be ready to handle -EINTR failure from system calls, it's something which is very difficult to verify and test and could lead to once-in-a-blue-moon head scratchy kind of failures. I think most of those issues can be tackled with minor narrow-scoped changes to the kernel. Do you guys have things on mind which the kernel can do to make these things more transparent or safer? * The feats dmtcp achieves with its set of workarounds are impressive but at the same time look quite hairy. Christoph said that having a standard userland C-R implementation would be quite useful and IMHO it would be helpful in that direction if the implementation is modularized enough so that the core functionality and the set of workarounds can be easily separated. Is it already so? Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 8:05 ` Tejun Heo @ 2010-11-04 16:44 ` Gene Cooperman 2010-11-05 9:28 ` Tejun Heo 2010-11-05 22:24 ` Oren Laadan 1 sibling, 1 reply; 111+ messages in thread From: Gene Cooperman @ 2010-11-04 16:44 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, Oren Laadan, ksummit-2010-discuss, linux-kernel, Gene Cooperman, hch Thanks for your comments. We apologize for the top-post. It was accidental. > > In our personal view, a key difference between in-kernel and userland > > approaches is the issue of security. > That's an interesting point but I don't think it's a dealbreaker. > ... but it's not like CR is gonna be deployed on > majority of desktops and servers (if so, let's talk about it then). This is a good point to clarify some issues. C/R has several good targets. For example, BLCR has targeted HPC batch facilities, and does it well. DMTCP started life on the desktop, and it's still a primary focus of DMTCP. We worked to support screen on this release precisely so that advanced desktop users have the option of putting their whole screen session under checkpoint control. It complements the core goal of screen: If you walk away from a terminal, you can get back the session elsewhere. If your session crashes, you can get back the session elsewhere (depending on where you save the checkpoint files, of course :-) ). > * As Oren pointed out in another message, there are somethings which > could seem a bit too visible to the target application. Like the > manager thread (is it visible to the application or is it hidden by > the libc wrapper?) and reserved signal. Also, while it's true that > all programs should be ready to handle -EINTR failure from system > calls, it's something which is very difficult to verify and test and > could lead to once-in-a-blue-moon head scratchy kind of failures. These are also some excellent points for discussion! The manager thread is visible. For example, if you run a gdb session under checkpoint control (only available in our unstable branch, currently), then the gdb session will indeed see the checkpoint manager thread. So, yes. We are not totally transparent, and a skilled user must account for this. There are analogies (the manager thread in the original LinuxThreads, the rare misfortune of gdb to lose track of the stack frames). We try to hid the reserved signal (SIGUSR2 by default, but the user can configure it to anything else). We put wrappers around system calls that might see our signal handler, but I'm sure there are cases where we might not succeed --- and so a skilled user would have to configure to use a different signal handler. And of course, there is the rare application that repeatedly resets _every_ signal. We encountered this in an earlier version of Maple, and the Maple developers worked with us to open up a hole so that we could checkpoint Maple in future versions. > [while] all programs should be ready to handle -EINTR failure from system > calls, it's something which is very difficult to verify and test and > could lead to once-in-a-blue-moon head scratchy kind of failures. Exactly right! Excellent point. Perhaps this gets down to philosophy, and what is the nature of a bug. :-) In some cases, we have encountered this issue. Our solution was either to refuse to checkpoint within certain system calls, or to check the return value and if there was an -EINTR, then we would re-execute the system call. This works again, because we are using wrappers around many (but not all) of the system calls. > Do you guys have things on mind which the > kernel can do to make these things more transparent or safer? For the most part, we've always found a way to work within the current design of the kernel. We consider this a tribute to the Linux kernel design. They provided hooks in cases that userland C/R needs, even though the hooks were there simply on general design principles. But since you ask :-), there is one thing on our wish list. We handle address space randomization, vdso, vsyscall, and so on quite well. We do not turn off address space randomization (although on restart, we map user segments back to their original addresses). Probably the randomized value of brk (end-of-data or end of heap) is the thing that gave us the most troubles and that's where the code is the most hairy. > * The feats dmtcp achieves with its set of workarounds are impressive > but at the same time look quite hairy. Christoph said that having a > standard userland C-R implementation would be quite useful and IMHO > it would be helpful in that direction if the implementation is > modularized enough so that the core functionality and the set of > workarounds can be easily separated. Is it already so? The implementation is reasonably modularized. In the rush to address bugs or feature requirements of users, we sometimes cut corners. We intend to go back and fix those things. Roughly, the architecture of DMTCP is to do things in two layers: MTCP handles a single multi-threaded process. There is a separate library mtcp.so. The higher layer (redundantly again called DMTCP) is implemented in dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot of what would be done within kernel C/R. But the higher DMTCP layer takes on some of those responsibilities in places. For example, DMTCP does part of analyzing the pseudo-ttys, since it's not always easy to ensure that it's the controlling terminal of some process that can checkpoint things in the MTCP layer. Beyond that, the wrappers around system calls are essentially perfectly modular. Some system calls go together to support a single kernel feature, and those wrappers are kept in a common file. There are some very few program-specific workarounds. If you look at the main routine of dmtcp_checkpoint.cpp, you'll find most of them. For example, if it's a setuid process, since we don't have root privilege, we can't preload our dmtcphijack.so. So, we copy the setuid process to our own /tmp, and execute it there without setuid. In the case of screen, it wants to use /var/... (forgot the directory). But screen has an option to use a different directory. Similarly, if the distro is running an NSCD daemon, then gethostname and similar calls go the NSCD daemon. On restart, we have to re-initialize communication with the NSCD daemon. I have to run to do some other things. But I'll check back on the remaining (and any new) posts on this list later today. Thanks very much for the interesting discussion. We've felt too isolated for too long. But we didn't think we had something important enough before to disturb the kernel developers with a discussion. I hope DMTCP is starting to become mature enough that this discussion can now benefit everybody. We certainly hope to learn a lot from it. Thanks again. - Gene ==== On Thu, Nov 04, 2010 at 09:05:28AM +0100, Tejun Heo wrote: > Hello, > > On 11/04/2010 04:40 AM, Kapil Arya wrote: > > (Sorry for resending the message; the last message contained some html > > tags and was rejected by server) > > And please also don't top-post. Being the antisocial egomaniacs we > are, people on lkml prefer to dissect the messages we're replying to, > insert insulting comments right where they would be most effective and > remove the passages which can't yield effective insults. :-) > > > In our personal view, a key difference between in-kernel and userland > > approaches is the issue of security. The Linux C/R developers state > > the issue very well in their FAQ (question number 7): > >> https://ckpt.wiki.kernel.org/index.php/Faq : > >> 7. Can non-root users checkpoint/restart an application ? > >> > >> For now, only users with CAP_SYSADMIN privileges can C/R an > >> application. This is to ensure that the checkpoint image has not been > >> tampered with and will be treated like a loadable kernel-module. > > That's an interesting point but I don't think it's a dealbreaker. > Kernel CR is gonna require userland agent anyway and access control > can be done there. Being able to snapshot w/o root privieldge > definitely is a plust but it's not like CR is gonna be deployed on > majority of desktops and servers (if so, let's talk about it then). > > > Strategies like these are easily handled in userspace. We suspect > > that while one may begin with a pure kernel approach, eventually, > > one will still want to add a userland component to achieve this kind > > of flexibility, just as BLCR has already done. > > Yeap, agreed. There gotta be user agents which can monitor and > manipulate userland states. It's a fundamentally nasty job, that of > collecting and applying application-specific workarounds. I've only > glanced the dmtcp paper so my understanding is pretty superficial. > With that in mind, can you please answer some of my curiosities? > > * As Oren pointed out in another message, there are somethings which > could seem a bit too visible to the target application. Like the > manager thread (is it visible to the application or is it hidden by > the libc wrapper?) and reserved signal. Also, while it's true that > all programs should be ready to handle -EINTR failure from system > calls, it's something which is very difficult to verify and test and > could lead to once-in-a-blue-moon head scratchy kind of failures. > > I think most of those issues can be tackled with minor narrow-scoped > changes to the kernel. Do you guys have things on mind which the > kernel can do to make these things more transparent or safer? > > * The feats dmtcp achieves with its set of workarounds are impressive > but at the same time look quite hairy. Christoph said that having a > standard userland C-R implementation would be quite useful and IMHO > it would be helpful in that direction if the implementation is > modularized enough so that the core functionality and the set of > workarounds can be easily separated. Is it already so? > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 16:44 ` Gene Cooperman @ 2010-11-05 9:28 ` Tejun Heo 2010-11-05 23:18 ` Oren Laadan ` (2 more replies) 0 siblings, 3 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-05 9:28 UTC (permalink / raw) To: Gene Cooperman Cc: Kapil Arya, Oren Laadan, ksummit-2010-discuss, linux-kernel, hch Hello, On 11/04/2010 05:44 PM, Gene Cooperman wrote: >>> In our personal view, a key difference between in-kernel and userland >>> approaches is the issue of security. >> >> That's an interesting point but I don't think it's a dealbreaker. >> ... but it's not like CR is gonna be deployed on >> majority of desktops and servers (if so, let's talk about it then). > > This is a good point to clarify some issues. C/R has several good > targets. For example, BLCR has targeted HPC batch facilities, and > does it well. > > DMTCP started life on the desktop, and it's still a primary focus of > DMTCP. We worked to support screen on this release precisely so > that advanced desktop users have the option of putting their whole > screen session under checkpoint control. It complements the core > goal of screen: If you walk away from a terminal, you can get back > the session elsewhere. If your session crashes, you can get back > the session elsewhere (depending on where you save the checkpoint > files, of course :-) ). Call me skeptical but I still don't see, yet, it being a mainstream thing (for average sysadmin John and proverbial aunt Tilly). It definitely is useful for many different use cases tho. Hey, but let's see. > These are also some excellent points for discussion! The manager thread > is visible. For example, if you run a gdb session under checkpoint > control (only available in our unstable branch, currently), then > the gdb session will indeed see the checkpoint manager thread. I don't think gdb seeing it is a big deal as long as it's hidden from the application itself. > We try to hid the reserved signal (SIGUSR2 by default, but the user > can configure it to anything else). We put wrappers around system > calls that might see our signal handler, but I'm sure there are > cases where we might not succeed --- and so a skilled user would > have to configure to use a different signal handler. And of course, > there is the rare application that repeatedly resets _every_ signal. > We encountered this in an earlier version of Maple, and the Maple > developers worked with us to open up a hole so that we could > checkpoint Maple in future versions. > >> [while] all programs should be ready to handle -EINTR failure from system >> calls, it's something which is very difficult to verify and test and >> could lead to once-in-a-blue-moon head scratchy kind of failures. > > Exactly right! Excellent point. Perhaps this gets down to > philosophy, and what is the nature of a bug. :-) In some cases, we > have encountered this issue. Our solution was either to refuse to > checkpoint within certain system calls, or to check the return value > and if there was an -EINTR, then we would re-execute the system > call. This works again, because we are using wrappers around many > (but not all) of the system calls. I'm probably missing something but can't you stop the application using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry about -EINTR failures (there are some exceptions but nothing really to worry about). Also, unless the manager thread needs to be always online, you can inject manager thread by manipulating the target process states while taking a snapshot. > But since you ask :-), there is one thing on our wish list. We > handle address space randomization, vdso, vsyscall, and so on quite > well. We do not turn off address space randomization (although on > restart, we map user segments back to their original addresses). > Probably the randomized value of brk (end-of-data or end of heap) is > the thing that gave us the most troubles and that's where the code > is the most hairy. Can you please elaborate a bit? What do you want to see changed? > The implementation is reasonably modularized. In the rush to > address bugs or feature requirements of users, we sometimes cut > corners. We intend to go back and fix those things. Roughly, the > architecture of DMTCP is to do things in two layers: MTCP handles a > single multi-threaded process. There is a separate library mtcp.so. > The higher layer (redundantly again called DMTCP) is implemented in > dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot of > what would be done within kernel C/R. But the higher DMTCP layer > takes on some of those responsibilities in places. For example, > DMTCP does part of analyzing the pseudo-ttys, since it's not always > easy to ensure that it's the controlling terminal of some process > that can checkpoint things in the MTCP layer. > > Beyond that, the wrappers around system calls are essentially > perfectly modular. Some system calls go together to support a > single kernel feature, and those wrappers are kept in a common file. I see. I just thought that it would be helpful to have the core part - which does per-process checkpointing and restoring and corresponds to the features implemented by in-kernel CR - as a separate thing. It already sounds like that is mostly the case. I don't have much idea about the scope of the whole thing, so please feel free to hammer senses into me if I go off track. From what I read, it seems like once the target process is stopped, dmtcp is able to get most information necessary from kernel via /proc and other methods but the paper says that it needs to intercept socket related calls to gather enough information to recreate them later. I'm curious what's missing from the current /proc. You can map socket to inode from /proc/*/fd which can be matched to an entry in /proc/*/net/PROTO to find out the addresses and most socket options should be readable via getsockopt. Am I missing something? I think this is why userland CR implementation makes much more sense. Most of states visible to a userland process are rather rigidly defined by standards and, ultimately, ABI and the kernel exports most of those information to userland one way or the other. Given the right set of needed features, most of which are probabaly already implemented, a userland implementation should have access to most information necessary to checkpoint without resorting to too messy methods and then there inevitably needs to be some workarounds to make CR'd processes behave properly w.r.t. other states on the system, so userland workarounds are inevitable anyway unless it resorts to preemtive separation using namespaces and containers, which I frankly think isn't much of value already and more so going forward. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-05 9:28 ` Tejun Heo @ 2010-11-05 23:18 ` Oren Laadan 2010-11-06 10:13 ` Tejun Heo 2010-11-06 0:36 ` Kapil Arya 2010-11-06 5:32 ` Matt Helsley 2 siblings, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-05 23:18 UTC (permalink / raw) To: Tejun Heo Cc: Gene Cooperman, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch On 11/05/2010 05:28 AM, Tejun Heo wrote: > Hello, > > On 11/04/2010 05:44 PM, Gene Cooperman wrote: >>>> In our personal view, a key difference between in-kernel and userland >>>> approaches is the issue of security. >>> >>> That's an interesting point but I don't think it's a dealbreaker. >>> ... but it's not like CR is gonna be deployed on >>> majority of desktops and servers (if so, let's talk about it then). >> >> This is a good point to clarify some issues. C/R has several good >> targets. For example, BLCR has targeted HPC batch facilities, and >> does it well. >> >> DMTCP started life on the desktop, and it's still a primary focus of >> DMTCP. We worked to support screen on this release precisely so >> that advanced desktop users have the option of putting their whole >> screen session under checkpoint control. It complements the core >> goal of screen: If you walk away from a terminal, you can get back >> the session elsewhere. If your session crashes, you can get back >> the session elsewhere (depending on where you save the checkpoint >> files, of course :-) ). > > Call me skeptical but I still don't see, yet, it being a mainstream > thing (for average sysadmin John and proverbial aunt Tilly). It > definitely is useful for many different use cases tho. Hey, but let's > see. > >> These are also some excellent points for discussion! The manager thread >> is visible. For example, if you run a gdb session under checkpoint >> control (only available in our unstable branch, currently), then >> the gdb session will indeed see the checkpoint manager thread. > > I don't think gdb seeing it is a big deal as long as it's hidden from > the application itself. > >> We try to hid the reserved signal (SIGUSR2 by default, but the user >> can configure it to anything else). We put wrappers around system >> calls that might see our signal handler, but I'm sure there are >> cases where we might not succeed --- and so a skilled user would >> have to configure to use a different signal handler. And of course, >> there is the rare application that repeatedly resets _every_ signal. >> We encountered this in an earlier version of Maple, and the Maple >> developers worked with us to open up a hole so that we could >> checkpoint Maple in future versions. >> >>> [while] all programs should be ready to handle -EINTR failure from system >>> calls, it's something which is very difficult to verify and test and >>> could lead to once-in-a-blue-moon head scratchy kind of failures. >> >> Exactly right! Excellent point. Perhaps this gets down to >> philosophy, and what is the nature of a bug. :-) In some cases, we >> have encountered this issue. Our solution was either to refuse to >> checkpoint within certain system calls, or to check the return value >> and if there was an -EINTR, then we would re-execute the system >> call. This works again, because we are using wrappers around many >> (but not all) of the system calls. > > I'm probably missing something but can't you stop the application > using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry > about -EINTR failures (there are some exceptions but nothing really to > worry about). Also, unless the manager thread needs to be always > online, you can inject manager thread by manipulating the target > process states while taking a snapshot. This is an excellent example to demonstrate several points: * To freeze the processes, you can use (quote) "hairy" signal overload mechanism, or even more hairy ptrace; both by the way have their performance problem with many processes/threads. Or you can use the in-kernel freezer-cgroup, and forget about workarounds, like linux-cr does. And ~200 lines in said diff are dedicated exactly to that. * Then, because both the workaround and the entire philosophy of MTCP c/r engine is that affected processes _participate_ in the checkpoint, their syscalls _must_ be interrupted. Contrastly, linux-cr kernel approach allows not only to checkpoint processes without collaboration, but also builds on the native signal handling kernel code to restart the system calls (both after unfreeze, and after restart), such that the original process does not observe -EINTR. >> But since you ask :-), there is one thing on our wish list. We >> handle address space randomization, vdso, vsyscall, and so on quite >> well. We do not turn off address space randomization (although on >> restart, we map user segments back to their original addresses). >> Probably the randomized value of brk (end-of-data or end of heap) is >> the thing that gave us the most troubles and that's where the code >> is the most hairy. > > Can you please elaborate a bit? What do you want to see changed? Aha ... another great example: yet another piece of the suspect diff in question is dedicated to allow a restarting process to request a specific location for the vdso. BTW, a real security expert (and I'm not one...) may argue that this operation should only be allowed to privileged users. In fact, if your code gets around the linux ASLR mechanisms, then someone should fix the kernel ASLR code :) >> The implementation is reasonably modularized. In the rush to >> address bugs or feature requirements of users, we sometimes cut >> corners. We intend to go back and fix those things. Roughly, the >> architecture of DMTCP is to do things in two layers: MTCP handles a >> single multi-threaded process. There is a separate library mtcp.so. >> The higher layer (redundantly again called DMTCP) is implemented in >> dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot of >> what would be done within kernel C/R. But the higher DMTCP layer >> takes on some of those responsibilities in places. For example, >> DMTCP does part of analyzing the pseudo-ttys, since it's not always >> easy to ensure that it's the controlling terminal of some process >> that can checkpoint things in the MTCP layer. >> >> Beyond that, the wrappers around system calls are essentially >> perfectly modular. Some system calls go together to support a >> single kernel feature, and those wrappers are kept in a common file. > > I see. I just thought that it would be helpful to have the core part > - which does per-process checkpointing and restoring and corresponds > to the features implemented by in-kernel CR - as a separate thing. It > already sounds like that is mostly the case. FWIW, the restart portion of linux-cr is designed with this in mind - it is flexible enough to accommodate for smart userspace tools and wrappers that wish to mock with the processes and their resource post-restart (but before the processes resume execution). For example, a distributed checkpoint tool could, at restart time, reestablish the necessary network connections (which is much different than live migration of connections, and clearly not a kernel task). This way, it is trivial to migrate a distributed application from one set of hosts to another, on different networks, with very little effort. > > I don't have much idea about the scope of the whole thing, so please > feel free to hammer senses into me if I go off track. From what I > read, it seems like once the target process is stopped, dmtcp is able > to get most information necessary from kernel via /proc and other > methods but the paper says that it needs to intercept socket related > calls to gather enough information to recreate them later. I'm > curious what's missing from the current /proc. You can map socket to > inode from /proc/*/fd which can be matched to an entry in > /proc/*/net/PROTO to find out the addresses and most socket options > should be readable via getsockopt. Am I missing something? So you'll need mechanisms not only to read the data at checkpoint time but also to reinstate the data at restart time. By the time you are done, the kernel all the c/r code (the suspect diff in question _and_ the rest of the logic) in the form of new interfaces and ABIs to usersapce...; the userspace code will grow some more hair; and there will be zero maintainability gain. And at the same you won't be able to leverage optimizations only possible in the kernel. > > I think this is why userland CR implementation makes much more sense. > Most of states visible to a userland process are rather rigidly > defined by standards and, ultimately, ABI and the kernel exports most > of those information to userland one way or the other. Given the > right set of needed features, most of which are probabaly already > implemented, a userland implementation should have access to most > information necessary to checkpoint without resorting to too messy > methods and then there inevitably needs to be some workarounds to make > CR'd processes behave properly w.r.t. other states on the system, so > userland workarounds are inevitable anyway unless it resorts to To be precise, there are three types of userland workarounds: 1) userland workarounds to make a restarted application work when peer processrs aren't saved - e.g, in distributed checkpoint you need a workaround to rebuild the socket to the peer; or in his example with the 'ncsd' daemon from earlier in the thread. These are needed regardless of the c/r engine of choice. In many cases they can be avoided if applications are run in containers. (which can be as simple as running a program using 'nohup') 2) userland workarounds to duplicate virtualization logic already done by the kernel - like the userspace pid-namespace and the complex logic and hacks needed to make it work. This is completely unnecessary when you do kernel c/r. 3) userland workarounds to compensate for the fact that userspace can't get or set some state during checkpoint or restart. For example, in the kernel it's trivial to track shared files. How would you say, from userspace, if fd[0] of parent A and child B is the same file opened and then inherited, or the same filename opened twice individually ? For files, it is possible to figure this out in user space, e.g. by intercepting and tracking all forks and all file operations (including passing fd's via afunix sockets). There are other hairy ways to do it, but not quite so for other resources. As another example, consider SIDs and PGIDs. With proper algorithms you can ensure that your processes get the right SID at fork time. But in the general case, you can't reproduce PGIDs accurately without replaying what the processes (including those that had died already) behaved. And to track zombies at checkpoint, you'd need to actually collect them, so you must do it in a hairy wrapper, and keep the secret until the application calls wait(). But then, there may be some side effects due to collecting zombies, e.g. the pid may be reused against the application's expectation. Some of these have workarounds, some not. Do you really think that re-reimplementing linux and namespaces in userspace is the way to go ? Then, you can add to the kernel endless amount of interfaces to export all of this - both data, and the functionality to re-instate this data at checkpoint. But ... wait -- isn't that what linux-cr already does ? > preemtive separation using namespaces and containers, which I frankly > think isn't much of value already and more so going forward. That is one opinion. Then there are people using VPSs in commercial and private environments, for example. VMs are wonderful (re)invention. Regardless of any one single person's about VMs vs containers, both are here to stay, and both have their use-cases and users. IMHO, it is wrong to ignore the need for c/r and migration capabilities for containers, whether they run full desktop environments, multiple applications or single processes. Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-05 23:18 ` Oren Laadan @ 2010-11-06 10:13 ` Tejun Heo 0 siblings, 0 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-06 10:13 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch Hello, On 11/06/2010 12:18 AM, Oren Laadan wrote: >> I'm probably missing something but can't you stop the application >> using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry >> about -EINTR failures (there are some exceptions but nothing really to >> worry about). Also, unless the manager thread needs to be always >> online, you can inject manager thread by manipulating the target >> process states while taking a snapshot. > > This is an excellent example to demonstrate several points: > > * To freeze the processes, you can use (quote) "hairy" signal > overload mechanism, or even more hairy ptrace; both by the way have > their performance problem with many processes/threads. Or you can > use the in-kernel freezer-cgroup, and forget about workarounds, like > linux-cr does. And ~200 lines in said diff are dedicated exactly to > that. > > * Then, because both the workaround and the entire philosophy > of MTCP c/r engine is that affected processes _participate_ in > the checkpoint, their syscalls _must_ be interrupted. Contrastly, > linux-cr kernel approach allows not only to checkpoint processes > without collaboration, but also builds on the native signal > handling kernel code to restart the system calls (both after > unfreeze, and after restart), such that the original process > does not observe -EINTR. The above problems can be solved for userland C/R with small self-contained modification to a small part of the kernel. You're insisting that because currently some obscure corner cases aren't handled, the whole thing should be shoved in the kernel and the kernel should be serializing and deserializing its internal data structures for everything visible in the userland. That's silly at best. Note the "visible in the userland" part. Most of those parts are already discoverable without further modifications to kernel. The only sane approach would be add missing pieces which would not only benefit CR but other applications too. Also, you said the patches didn't have to change much because the data structures facing userland didn't change much over different kernel versions, which of course is true as it's so close to the userland visible ABI. That is _NOT_ a selling point for kernel CR. That's a BIG GLOWING SIGN telling you that you're on the frigging wrong side of the wall. > BTW, a real security expert (and I'm not one...) may argue that > this operation should only be allowed to privileged users. In fact, > if your code gets around the linux ASLR mechanisms, then someone > should fix the kernel ASLR code :) ASLR is to protect a program from itself not from outside. If you can ptrace a process, ASLR doesn't mean a thing. >> I see. I just thought that it would be helpful to have the core part >> - which does per-process checkpointing and restoring and corresponds >> to the features implemented by in-kernel CR - as a separate thing. It >> already sounds like that is mostly the case. > > FWIW, the restart portion of linux-cr is designed with this in > mind - it is flexible enough to accommodate for smart userspace > tools and wrappers that wish to mock with the processes and > their resource post-restart (but before the processes resume > execution). For example, a distributed checkpoint tool could, > at restart time, reestablish the necessary network connections > (which is much different than live migration of connections, > and clearly not a kernel task). This way, it is trivial to migrate > a distributed application from one set of hosts to another, on > different networks, with very little effort. Yeap, that was the reason why I asked how modularized that part of dmtcp was as it would directly compare with the in-kernel implementation. If they can be well separated, I think it would even be possible to switch between the two while keeping the upper set of workarounds the same. >> I don't have much idea about the scope of the whole thing, so please >> feel free to hammer senses into me if I go off track. From what I >> read, it seems like once the target process is stopped, dmtcp is able >> to get most information necessary from kernel via /proc and other >> methods but the paper says that it needs to intercept socket related >> calls to gather enough information to recreate them later. I'm >> curious what's missing from the current /proc. You can map socket to >> inode from /proc/*/fd which can be matched to an entry in >> /proc/*/net/PROTO to find out the addresses and most socket options >> should be readable via getsockopt. Am I missing something? > > So you'll need mechanisms not only to read the data at checkpoint > time but also to reinstate the data at restart time. By the time > you are done, the kernel all the c/r code (the suspect diff in > question _and_ the rest of the logic) in the form of new interfaces > and ABIs to usersapce...; the userspace code will grow some more > hair; and there will be zero maintainability gain. And at the same > you won't be able to leverage optimizations only possible in the > kernel. Unfortunately, for most things which matter, everything is already in place and if you just concentrate on the core part the hackiness seems quite manageable and I think it wouldn't be too difficult to reduce it further. I don't see why userland implementation wouldn't be able to snapshot any random process without LD_PRELOADs or whatever cooperation from it. And, if the COW thing is so important, we can collect the information and export it to userland via proc or ringbuffer. That's what qemu-kvm would need anyway, right? I don't think kvm guys would be so crazy as putting the whole snapshotter into the kernel. > To be precise, there are three types of userland workarounds: > > 1) userland workarounds to make a restarted application work when > peer processrs aren't saved - e.g, in distributed checkpoint you > need a workaround to rebuild the socket to the peer; or in his > example with the 'ncsd' daemon from earlier in the thread. > > These are needed regardless of the c/r engine of choice. In many > cases they can be avoided if applications are run in containers. > (which can be as simple as running a program using 'nohup') > > 2) userland workarounds to duplicate virtualization logic already > done by the kernel - like the userspace pid-namespace and the > complex logic and hacks needed to make it work. This is completely > unnecessary when you do kernel c/r. No, that's primarily not the feature of kerne CR. It's of namespaces and containers. > 3) userland workarounds to compensate for the fact that userspace > can't get or set some state during checkpoint or restart. For > example, in the kernel it's trivial to track shared files. How > would you say, from userspace, if fd[0] of parent A and child B is > the same file opened and then inherited, or the same filename > opened twice individually? For files, it is possible to figure > this out in user space, e.g. by intercepting and tracking all forks > and all file operations (including passing fd's via afunix sockets). Or, if it's a regular file, lseek() and see whether the offsets change together, or, even better, just toggle O_NDELAY with fcntl. > There are other hairy ways to do it, but not quite so for other > resources. If you think toggling O_NDELAY is hairy, let's add a noop flag bit or export whatever via /proc/*/fdinfo. We already have all that stuff for a reason. > As another example, consider SIDs and PGIDs. With proper algorithms > you can ensure that your processes get the right SID at fork time. > But in the general case, you can't reproduce PGIDs accurately > without replaying what the processes (including those that had died > already) behaved. > > And to track zombies at checkpoint, you'd need to actually collect > them, so you must do it in a hairy wrapper, and keep the secret > until the application calls wait(). But then, there may be some > side effects due to collecting zombies, e.g. the pid may be reused > against the application's expectation. > > Some of these have workarounds, some not. Do you really think that > re-reimplementing linux and namespaces in userspace is the way to go ? No, I think you're blowing corner cases, which are in Syberia cold paths, way out of proportion. None of the above justifies putting the whole thing in the kernel. Solve each problem with local solutions. You're basically doing the same thing with in-kernel implementation, the only difference being you side stepping ABI issues by saying that kernel CR format would stay _mostly_ stable and what changes would be dealt with from userland tools. Everything visible from usual userland applications should be (and is for the most part) defined by ABI. And if every state worthy of saving is well defined and visible from userland, there's no reason to do it from kernel. > Then, you can add to the kernel endless amount of interfaces to > export all of this - both data, and the functionality to re-instate > this data at checkpoint. But ... wait -- isn't that what linux-cr > already does ? I hope that's what linux-cr did. It unfortunately serializes and de-serializes in-kernel data structures which are already mostly visible from userland instead of hunting down and improving missing pieces. >> preemtive separation using namespaces and containers, which I frankly >> think isn't much of value already and more so going forward. > > That is one opinion. Then there are people using VPSs in commercial > and private environments, for example. > > VMs are wonderful (re)invention. Regardless of any one single > person's about VMs vs containers, both are here to stay, and both > have their use-cases and users. IMHO, it is wrong to ignore the > need for c/r and migration capabilities for containers, whether > they run full desktop environments, multiple applications or single > processes. Sure, I'm not ignoring them. I'm just saying in-kernel CR doesn't make a good trade off with its limited benefits and extensive complexity all across the kernel, and the reason why its benefits are limited is because it's sandwiched pretty tightly between userland CR and proper virtualization. Moreover, the space in-kernel CR tries occupy is getting smaller day by day. It just can't justify its complexity. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-05 9:28 ` Tejun Heo 2010-11-05 23:18 ` Oren Laadan @ 2010-11-06 0:36 ` Kapil Arya 2010-11-06 22:55 ` Oren Laadan 2010-11-17 10:45 ` Tejun Heo 2010-11-06 5:32 ` Matt Helsley 2 siblings, 2 replies; 111+ messages in thread From: Kapil Arya @ 2010-11-06 0:36 UTC (permalink / raw) To: Tejun Heo Cc: Gene Cooperman, Oren Laadan, ksummit-2010-discuss, linux-kernel, hch > I'm probably missing something but can't you stop the application > using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry > about -EINTR failures (there are some exceptions but nothing really to > worry about). Also, unless the manager thread needs to be always > online, you can inject manager thread by manipulating the target > process states while taking a snapshot. In fact CryoPid uses exactly the same approach and has been around for around 5 years. Not as much development effort has gone into CryoPid as DMTCP and so its application coverage is not as broad. But the larger issue for using PTRACE is that you can not have two superiors tracing the same inferior process. So if you want to checkpoint a gdb session or valgrind or tmux or strace, then you can not directly control and quiesce the inferior process being traced. Beyond that, we also have a vision (not yet implemented) of process virtualization by which one can change the behavior of a program. For example, if a distributed computation runs over infiniband, can we migrate to a TCP/IP cluster. For this, one needs the flexibility of wrappers around system calls. This vision of process virtualization also motivates why our own research project has steered away from in-kernel C/R. > > But since you ask :-), there is one thing on our wish list. We > > handle address space randomization, vdso, vsyscall, and so on quite > > well. We do not turn off address space randomization (although on > > restart, we map user segments back to their original addresses). > > Probably the randomized value of brk (end-of-data or end of heap) is > > the thing that gave us the most troubles and that's where the code > > is the most hairy. > > Can you please elaborate a bit? What do you want to see changed? Yes, we would love to elaborate :-). We began DMTCP with Linux kernel 2.6.3. When Address Space Layout Randomization was added, we were forced to add some hacks concerning VDSO location and end-of-data. end-of-data is the uglier part. On restart, we directly map each memory segment into the original address at checkpoint time. The issue comes in mapping heap back to its original location. We call sbrk() to reset the end-of-data to the end of the original heap. This fails if the randomized beginning-of-data/end-of-data given to us by the kernel for the restarted process is too far away from where we want to remap the heap. To get around this, we play games with legacy layout, other personality parameters, and RLIMIT_STACK (since the kernel uses RLIMIT_STACK in choosing the appropriate memory layout). For our wish list, we would like a way of telling the kernel, where to set beginning-of-data/end-of-data. Curiously enough, at the time at which Linux started randomizing address space, there was discussion of offering exactly this facility for the sake of legacy programs, but it turned out not to be needed. Similarly, it would be nice to tell the kernel where we want the VDSO page. Currently, we get around this by keeping two VDSO pages, the old one which we restore and the new one specified to us by the kernel when the restart process is created. This works well for, and so controlling the address of the VDSO page is less important for us. > I don't have much idea about the scope of the whole thing, so please > feel free to hammer senses into me if I go off track. From what I > read, it seems like once the target process is stopped, dmtcp is able > to get most information necessary from kernel via /proc and other > methods but the paper says that it needs to intercept socket related > calls to gather enough information to recreate them later. I'm > curious what's missing from the current /proc. You can map socket to > inode from /proc/*/fd which can be matched to an entry in > /proc/*/net/PROTO to find out the addresses and most socket options > should be readable via getsockopt. Am I missing something? The design of DMTCP was decided upon roughly during the period from Linux 2.6.3 through Linux 2.6.18. At that time, /proc/*/net did not exist. You are right that this can provide much better design for DMTCP and eliminate some of our wrappers. Thanks very much for pointing this out. We are now egar to implement a new design based on /proc/*/net in the near future. Since /proc/*/net provides a simpler design for sockets, we started wondering what other simplifications may be possible. Here is one possibility, in the case of shared file descriptors, DMTCP goes through two barriers in order to decide which process will be responsible for checkpointing which shared-file descriptor. It works and the overhead is reasonable, but if you have additional suggestion for this case, we would be very interested. > I think this is why userland CR implementation makes much more sense. > Most of states visible to a userland process are rather rigidly > defined by standards and, ultimately, ABI and the kernel exports most > of those information to userland one way or the other. Given the > right set of needed features, most of which are probabaly already > implemented, a userland implementation should have access to most > information necessary to checkpoint without resorting to too messy > methods and then there inevitably needs to be some workarounds to make > CR'd processes behave properly w.r.t. other states on the system, so > userland workarounds are inevitable anyway unless it resorts to > preemtive separation using namespaces and containers, which I frankly > think isn't much of value already and more so going forward. Its a very good point and we agree completely. Here are some examples where we believe, a userland component is inevitable even if one begins with in-kernel C/R: 1. NSCD deamon -- in calls to libc::gethostname() etc. libc arranges for communication by sharing a memory segment with application process. Our code recognized this shared memory because it starts with /var/*/nscd. 2. syslogd -- Application using syslog have a socket open to the syslog deamon. DMTCP makes a system call to turnoff logging at checkpoint time. 3. X-windows terminals -- xterm/gnome-terminal/konsole all emulate ANSI terminals. They support various ANSI features such as setting up scrolling region above status line. GNU screen uses the scrolling region feature. On restart, we have to convince GNU screen and similar programs to re-initialize their ANSI terminal. We do this successfully by sending a SIGWINCH on restart, since it has to re-initialize the ANSI terminal whenever the window size changes. In fact we send one SIGWINCH and when the application calls ioctl(), to get the window size, we lie and say that the window size changed, and we then send another SIGWINCH from within the wrapper to force the application to recheck the window size and discover that the window is back to its original size. 4. X11 apps -- The current approach to checkpointing X-windows application is to checkpoint them within a VNC server. We plan to add wrappers around calls to libX11.so so that we can discover the state of an X11 window at checkpoint time and then restart just the single X11 application. This avoids the need to also checkpoint the X11 server which minimized the size of the the checkpoint image that needs to be written to the disk. 5. GNU Screen -- DMTCP sets SCREEN_DIR to a temp directory in order to avoid the issue that occurs when the setsuid screen process tries to across /var/run/uscreen. Otherwise we would have difficulty at restart time when the checkpoint image has no setsuid privilege. We don't know if there are similar issues with an in-kernel C/R. We really enjoyed this discussion. If you are interested, we would be happy to talk further by phone in order to take advantage of the higher bandwidth. Best, -Gene and Kapil ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 0:36 ` Kapil Arya @ 2010-11-06 22:55 ` Oren Laadan 2010-11-07 19:42 ` Gene Cooperman 2010-11-17 10:45 ` Tejun Heo 1 sibling, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-06 22:55 UTC (permalink / raw) To: Kapil Arya Cc: Tejun Heo, Gene Cooperman, ksummit-2010-discuss, linux-kernel, hch On 11/05/2010 08:36 PM, Kapil Arya wrote: >> I'm probably missing something but can't you stop the application >> using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry >> about -EINTR failures (there are some exceptions but nothing really to >> worry about). Also, unless the manager thread needs to be always >> online, you can inject manager thread by manipulating the target >> process states while taking a snapshot. > > In fact CryoPid uses exactly the same approach and has been around for around 5 > years. Not as much development effort has gone into CryoPid as DMTCP and so its > application coverage is not as broad. But the larger issue for using PTRACE is > that you can not have two superiors tracing the same inferior process. So if you > want to checkpoint a gdb session or valgrind or tmux or strace, then you can not > directly control and quiesce the inferior process being traced. > > Beyond that, we also have a vision (not yet implemented) of process > virtualization by which one can change the behavior of a program. For example, > if a distributed computation runs over infiniband, can we migrate to a TCP/IP > cluster. For this, one needs the flexibility of wrappers around system calls. > This vision of process virtualization also motivates why our own research > project has steered away from in-kernel C/R. This is a very useful vision. However, it is unrelated to how you do c/r, but rather to what you do after you restart and before you let the application resume execution. For example, in your example, you'd need to wrap the library calls (e.g. of MPI implementation) and replaced them to use TCP/IP or infiniband. Wrapping on system calls won't help you. Or you could just replace the resource - e.g., make the restarted application use s socket for stdout instead of the tty, so you can redirect the output to where-ever. Both methods are orthogonal to the c/r itself: linux-cr will allow you to replace/modify resources if you so wish, and I suspect that MTCP also can/will. Interposing on library calls is possible with MTCP methods, or using binary instrumentation, or PIN, or DynInst, or LD_PRELOAD. The only two reasons to interpose on systems calls, as I noted in earlier message (http://lkml.org/lkml/2010/11/5/262 - see points "2)" and "3)" about userland-workarounds): One - to virtualize in userspace reosurces (e.g. pids) that the kernel already knows how to virtualize. Two - to track state of resources during execution and lie about their state when needed, because userspace can't cleanly save and restore their state. Virtualization through interposition is extremely tricky in and out of the kernel. The examples given throughout this thread (by either side) expose the tip of the iceberg. Interposition as a technique is full of security and other pitfalls, as discussed by extensive literature in the area. (I cited in another email). So I'll repeat the question I asked there: is re-reimplementing chunks of kernel functionality and all namespaces in userspace the way to go ? > >>> But since you ask :-), there is one thing on our wish list. We >>> handle address space randomization, vdso, vsyscall, and so on quite >>> well. We do not turn off address space randomization (although on >>> restart, we map user segments back to their original addresses). >>> Probably the randomized value of brk (end-of-data or end of heap) is >>> the thing that gave us the most troubles and that's where the code >>> is the most hairy. >> [snip] > The design of DMTCP was decided upon roughly during the period from Linux 2.6.3 > through Linux 2.6.18. At that time, /proc/*/net did not exist. You are right > that this can provide much better design for DMTCP and eliminate some of our > wrappers. Thanks very much for pointing this out. We are now egar to implement a > new design based on /proc/*/net in the near future. > > Since /proc/*/net provides a simpler design for sockets, we started wondering > what other simplifications may be possible. Here is one possibility, in the case > of shared file descriptors, DMTCP goes through two barriers in order to decide > which process will be responsible for checkpointing which shared-file > descriptor. It works and the overhead is reasonable, but if you have additional > suggestion for this case, we would be very interested. What is "reasonable" overhead ? For which applications ? What about a 'kernel make' ? What about servers (db, web, etc) ? What about VPSs/VDIs ? Can we do better, including for HPC ? ... > >> I think this is why userland CR implementation makes much more sense. >> Most of states visible to a userland process are rather rigidly >> defined by standards and, ultimately, ABI and the kernel exports most >> of those information to userland one way or the other. Given the >> right set of needed features, most of which are probabaly already >> implemented, a userland implementation should have access to most >> information necessary to checkpoint without resorting to too messy >> methods and then there inevitably needs to be some workarounds to make >> CR'd processes behave properly w.r.t. other states on the system, so >> userland workarounds are inevitable anyway unless it resorts to >> preemtive separation using namespaces and containers, which I frankly >> think isn't much of value already and more so going forward. > > Its a very good point and we agree completely. Here are some examples where we > believe, a userland component is inevitable even if one begins with in-kernel > C/R: Exactly ! Wrapping around apps to isolate them from the environment is desirable, regardless of how you technically c/r the apps, when you want to be able to c/r apps outside their native environment. Generally, you can either include the environment in the checkpoint, or provide wrappers to virtualize it after restart, or modify the app so that it knows how to adapt to new environments after restart. Either way, you need to technically c/r the app, no matter how much userspace trickery you may choose to apply afterwards if needed. And doing so in-kernel is more transparent (yes, transparent means that it does not require LD_PRELOAD or collaboration of the application! nor does it require userspace virtualizations of so many things already provided by the kernel today), more generic, more flexible, provides more guarantees, cover more types or states of resources, and can perform significantly better. And then, if you want to work with dmtcp's type of scenarios, you could use the generic c/r and apply their wrappers on top of it ! [snip] Thanks, Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 22:55 ` Oren Laadan @ 2010-11-07 19:42 ` Gene Cooperman 2010-11-07 21:30 ` Oren Laadan 0 siblings, 1 reply; 111+ messages in thread From: Gene Cooperman @ 2010-11-07 19:42 UTC (permalink / raw) To: Oren Laadan Cc: Kapil Arya, Tejun Heo, Gene Cooperman, ksummit-2010-discuss, linux-kernel, hch I'd like to add a few clafifications, below, about DMTCP concerning Oren's comments. I'd also like to point out that we've had about 100 downloads per month from sourceforge (and some interesting use cases from end users) over the last year (although the sourceforge numbers do go up and down :-) ). In general, I think we'll all understand the situation better after having had the opportunity to talk offline. Below are some clarifications about DMTCP. === > For example, in your example, you'd need to wrap the library calls > (e.g. of MPI implementation) and replaced them to use TCP/IP or > infiniband. Wrapping on system calls won't help you. We do not put any wrappers around MPI library calls. MPI calls things like open, close, connect, listen, execve({"ssh", ...}, ...), etc. At this time, DMTCP adds wrappers _only_ around calls to libc.so and libpthread.so . This is sufficient to checkpoint a distributed computation like MPI. > The only two reasons to interpose on systems calls, ... > > One - to virtualize in userspace reosurces (e.g. pids) that the > kernel already knows how to virtualize. > > Two - to track state of resources during execution and lie about > their state when needed, because userspace can't cleanly save > and restore their state. Just a small correction about interposition. The primary "Reason Two" for interposing on system calls should be to _spy_ on what the user process is doing and save that information. For the most part, we do not _lie about their state when needed_. I agree that virtualization of pids is an exception where we have to lie, but that was already stated as "Reason One" above. At restart time, we may also recreate resources that are no longer in the kernel. But this is not an example of interposition. I suppose that it is an example of lying, but every C/R technique will need to do this. Later, perhaps Oren, Kapil and I can browse the DMTCP code together, and we can look exactly at what each wrapper is doing. The system call wrappers are, in fact, the smaller part of the DMTCP code. It's about 3000 lines of code. For anybody who is curious about what our wrappers do, please download the DMTCP source code, and look at .../dmtcp/src/*wrapper*.cpp . > So I'll repeat the question I asked there: is re-reimplementing > chunks of kernel functionality and all namespaces in userspace > the way to go ? If you're referring to interposition here, that takes place essentially in the wrappers, and the wrappers are only 3000 lines of code in DMTCP. Also, I don't believe that we're "re-implementing chunks of kernel functionality", but let's continue that discussion offline. > What is "reasonable" overhead ? > For which applications ? > What about a 'kernel make' ? > What about servers (db, web, etc) ? > What about VPSs/VDIs ? > Can we do better, including for HPC ? Again, all good questions that will be answered more easily offline. > ... (yes, transparent means that > it does not require LD_PRELOAD or collaboration of the application! > nor does it require userspace virtualizations of so many things > already provided by the kernel today), more generic, more flexible, > provides more guarantees, cover more types or states of resources, > and can perform significantly better. I still haven't understood why you object to the DMTCP use of LD_PRELOAD. How will the user app ever know that we used LD_PRELOAD, since we remove LD_PRELOAD from the environment before the user app libraries and main can begin? And, if you really object to LD_PRELOAD, then there are other ways to capture control. Similarly, I'll have to understand better what you mean by the _collaboration of the application_. DMTCP operates on unmodified application binaries. Basically, if _transparent_ means that one is not allowed to use anything at all from userland, then I agree with you that no userland checkpointing can ever be transparent. But, I think that's a biased definition of _transparent_. :-) > And then, if you want to work with dmtcp's type of scenarios, you > could use the generic c/r and apply their wrappers on top of it ! Agreed. As before, I'm looking forward to us analyzing all the use cases offline. I think that we're all (myself included) in the situation of the three blind men and the elephant. I think part of the misunderstanding is that we're each thinking about a different use case, and so we (myself included) end up comparing apples and oranges. Thanks, - Gene ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-07 19:42 ` Gene Cooperman @ 2010-11-07 21:30 ` Oren Laadan 2010-11-07 23:05 ` Gene Cooperman 0 siblings, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-07 21:30 UTC (permalink / raw) To: Gene Cooperman Cc: Kapil Arya, Tejun Heo, ksummit-2010-discuss, linux-kernel, hch On 11/07/2010 02:42 PM, Gene Cooperman wrote: > I'd like to add a few clafifications, below, about DMTCP concerning > Oren's comments. I'd also like to point out that we've had about 100 > downloads per month from sourceforge (and some interesting use cases > from end users) over the last year (although the sourceforge numbers > do go up and down :-) ). In general, I think we'll all understand the > situation better after having had the opportunity to talk offline. > Below are some clarifications about DMTCP. > === > >> For example, in your example, you'd need to wrap the library calls >> (e.g. of MPI implementation) and replaced them to use TCP/IP or >> infiniband. Wrapping on system calls won't help you. > > We do not put any wrappers around MPI library calls. MPI calls things > like open, close, connect, listen, execve({"ssh", ...}, ...), etc. > At this time, DMTCP adds wrappers _only_ around calls to libc.so > and libpthread.so . This is sufficient to checkpoint a distributed > computation like MPI. Of course. And you don't need syscall virtualization for this. Zap did it already many years ago :) Only problem with the above is that, conveniently enough, you _left out_ the context: >> For example, >> if a distributed computation runs over infiniband, can we migrate to a TCP/IP >> cluster. For this, one needs the flexibility of wrappers around system calls. Do you also support checkpoint a distributed app that uses an infiniband MPI stack and restart it with a TCP based MPI stack ? Can you do it with only syscall wrapping and without knowledge on the MPI implementation and some MPI-specific logic in the wrappers ? I'm curious how you do that without wrapping around MPI calls, or without an c/r-aware implementation of MPI. Again, this is unrelated to how you do the core c/r work. I think we both agree that _this_ kind of app-wrappers/app-awareness is useful for certain uses of c/r. [snip] >> So I'll repeat the question I asked there: is re-reimplementing >> chunks of kernel functionality and all namespaces in userspace >> the way to go ? > > If you're referring to interposition here, that takes place essentially > in the wrappers, and the wrappers are only 3000 lines of code in DMTCP. > Also, I don't believe that we're "re-implementing chunks of kernel > functionality", but let's continue that discussion offline. The interposition itself is relatively simple (though not atomic). The problem is the logic to "spy" on and "lie" to the applications. Examples: saving ptrace state, saving FD_CLOEXEC flag, correctly maintaining a userspace pid-ns, etc. [...] > >> ... (yes, transparent means that >> it does not require LD_PRELOAD or collaboration of the application! >> nor does it require userspace virtualizations of so many things >> already provided by the kernel today), more generic, more flexible, >> provides more guarantees, cover more types or states of resources, >> and can perform significantly better. > > I still haven't understood why you object to the DMTCP use of LD_PRELOAD. > How will the user app ever know that we used LD_PRELOAD, since we remove > LD_PRELOAD from the environment before the user app libraries and main > can begin? And, if you really object to LD_PRELOAD, then there are > other ways to capture control. Similarly, I'll have to understand better I don't object to it per se - it's actually pretty useful oftentimes. But in our context, it has limitations. For example, it does not cover static applications, nor apps that call syscalls directly using int 0x80. Also, it conflicts with LD_PRELOAD possibly needed for other software (like valgrind) - for which again you would need yet another per-app wrapper, at the very least. > what you mean by the _collaboration of the application_. DMTCP operates > on unmodified application binaries. I mean that the applications needs to be scheduled and to run to participate in its own checkpoint. You use syscall interposition and signals games to do exactly that - gain control over the app and run your library's code. This has at least three negatives: first, some apps don't want to or can't run - e.g. ptraced, or swapped (think incremental checkpoint: why swap everything in ?!); Second, the coordination can take significant time, especially if many tasks/threads and resources are involved; Third, it modifies the state of the app - if something goes wrong while you use c/r to migrate an app, you impact the app. (While 'ptrace' relieves you from the need for "collaboration" of processes, but doesn't address the other problems and adds its own issues). > Basically, if _transparent_ means > that one is not allowed to use anything at all from userland, then I > agree with you that no userland checkpointing can ever be transparent. > But, I think that's a biased definition of _transparent_. :-) "Transparent" c/r means "invisible" to the user/apps, i.e. that you don't restrict the user or the app in what they do and how they do it. Did you ever try to 'ltrace skype' ? there exists useful and popular software that doesn't like being spied after... Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-07 21:30 ` Oren Laadan @ 2010-11-07 23:05 ` Gene Cooperman 2010-11-08 3:55 ` Oren Laadan 0 siblings, 1 reply; 111+ messages in thread From: Gene Cooperman @ 2010-11-07 23:05 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Kapil Arya, Tejun Heo, ksummit-2010-discuss, linux-kernel, hch On Sun, Nov 07, 2010 at 04:30:19PM -0500, Oren Laadan wrote: > > > On 11/07/2010 02:42 PM, Gene Cooperman wrote: > >I'd like to add a few clafifications, below, about DMTCP concerning > >Oren's comments. I'd also like to point out that we've had about 100 > >downloads per month from sourceforge (and some interesting use cases > >from end users) over the last year (although the sourceforge numbers > >do go up and down :-) ). In general, I think we'll all understand the > >situation better after having had the opportunity to talk offline. > >Below are some clarifications about DMTCP. > >=== > > > >>For example, in your example, you'd need to wrap the library calls > >>(e.g. of MPI implementation) and replaced them to use TCP/IP or > >>infiniband. Wrapping on system calls won't help you. > > > >We do not put any wrappers around MPI library calls. MPI calls things > >like open, close, connect, listen, execve({"ssh", ...}, ...), etc. > >At this time, DMTCP adds wrappers _only_ around calls to libc.so > >and libpthread.so . This is sufficient to checkpoint a distributed > >computation like MPI. > > Of course. And you don't need syscall virtualization for this. > Zap did it already many years ago :) Only problem with the above > is that, conveniently enough, you _left out_ the context: > > >> For example, > >> if a distributed computation runs over infiniband, can we migrate > to a TCP/IP > >> cluster. For this, one needs the flexibility of wrappers around > system calls. > > Do you also support checkpoint a distributed app that uses an > infiniband MPI stack and restart it with a TCP based MPI stack ? > Can you do it with only syscall wrapping and without knowledge > on the MPI implementation and some MPI-specific logic in the > wrappers ? I'm curious how you do that without wrapping around > MPI calls, or without an c/r-aware implementation of MPI. > ... Yes, that's exactly what we plan to do. And we have begun some of the initial work. And yes, we plan to do it without any MPI-specific logic. When we talk to each other offline, I'd be happy to give you more details of how we do it now for TCP "without wrapping around MPI calls, or without an c/r-aware implementation of MPI", and how we are working on extending that to Infiniband. > [snip] > > >>So I'll repeat the question I asked there: is re-reimplementing > >>chunks of kernel functionality and all namespaces in userspace > >>the way to go ? > > > >If you're referring to interposition here, that takes place essentially > >in the wrappers, and the wrappers are only 3000 lines of code in DMTCP. > >Also, I don't believe that we're "re-implementing chunks of kernel > >functionality", but let's continue that discussion offline. > > The interposition itself is relatively simple (though not atomic). > The problem is the logic to "spy" on and "lie" to the applications. > Examples: saving ptrace state, saving FD_CLOEXEC flag, correctly > maintaining a userspace pid-ns, etc. And let's wait for the offline discussion for that --- and we'll describe in detail at that time how we do each one of the things that you mention. It will be easier to discuss each of the things that you mention by looking at the DMTCP code "side-by-side" over the phone. We hope to show you that the logic is really not so complex. > > > >>... (yes, transparent means that > >>it does not require LD_PRELOAD or collaboration of the application! > >>nor does it require userspace virtualizations of so many things > >>already provided by the kernel today), more generic, more flexible, > >>provides more guarantees, cover more types or states of resources, > >>and can perform significantly better. > > > >I still haven't understood why you object to the DMTCP use of LD_PRELOAD. > >How will the user app ever know that we used LD_PRELOAD, since we remove > >LD_PRELOAD from the environment before the user app libraries and main > >can begin? And, if you really object to LD_PRELOAD, then there are > >other ways to capture control. Similarly, I'll have to understand better > > I don't object to it per se - it's actually pretty useful oftentimes. > But in our context, it has limitations. For example, it does not > cover static applications, nor apps that call syscalls directly > using int 0x80. For static apps, we would use other interposition techniques. And yes, we haven't implemented support of static apps so far, because our user base hasn't asked for it. We do handle apps that use the syscall system call to make system calls. We don't handle apps that directly use "int 0x80". Again, there are ways to do this, but our user base hasn't asked for it. In general, please keep in mind the principles that you rightly had to remind me of in a previous post. :-) Our two pieces of work are coming from two different directions with two different visions. Linux C/R wants to be so transparent that no user app can ever detect it. DMTCP wants to be transparent enough that any reasonable use case is covered. In particular, DMTCP considers distributed computations to be equally valid use cases for the core DMTCP C/R. I also agree that Linux C/R can be extended to cover distributed apps -- either through userland extensions, or maybe with techniques like in your excellent CLUSTER-2005 paper. Hence, DMTCP has grown its coverage of apps over the years. When we talk offline, let's talk about future use cases, and whether there are or are not showstoppers for a userland approach. > Also, it conflicts with LD_PRELOAD possibly needed > for other software (like valgrind) - for which again you would need > yet another per-app wrapper, at the very least. DMTCP does not conflict with the fact that valgrind uses LD_PRELOAD. We add dmtcphijack.so to the beginning of LD_PRELOAD before the user app starts. We then remove it before the app really starts. The LD_PRELOAD requests of valgrind continue to be honored. It all works. > >what you mean by the _collaboration of the application_. DMTCP operates > >on unmodified application binaries. > > I mean that the applications needs to be scheduled and to run to > participate in its own checkpoint. You use syscall interposition > and signals games to do exactly that - gain control over the app > and run your library's code. This has at least three negatives: > first, some apps don't want to or can't run - e.g. ptraced, or > swapped (think incremental checkpoint: why swap everything in ?!); > Second, the coordination can take significant time, especially if > many tasks/threads and resources are involved; Third, it modifies > the state of the app - if something goes wrong while you use c/r > to migrate an app, you impact the app. > > (While 'ptrace' relieves you from the need for "collaboration" > of processes, but doesn't address the other problems and adds > its own issues). Again, I'll add some clarification, although this will best be done offline. DMTCP does indeed do interposition of the 'syscall' system call in glibc. As for signals, we don't really play any signal games. The sole use of signals in DMTCP is for the checkpoint thread of a process to quiesce the user threads of that same thread. We use one reserved signal, and we use it solely internally within a single process. If the user app will allow us to use a single signal (e.g. SIGRTMIN+2), then we don't need any games or interposition at all. We were worried about apps that wish to set _every_ signal to SIG_IGN, etc. Next, let's consider what you say about wrappers around wrappers, and your valgrind example. Also, I'd like to make clear that we've tested primarily on gdb. If it's important, we could do a quick test on valgrind and report back. Our user base hasn't requested support for valgrind so far. Assuming that valgrind does use wrappers, we have a valgrind wrapper around a DMTCP wrapper around a glibc call, which itself is really a wrapper around a kernel API call. If it helps, then think of a wrapper as just another function, that calls an inner function. Object-oriented programming uses this principle all the time. Similarly, the glibc wrapper around a kernel API is just one more of these functions. Another way to view this is through the idea of layers. Each layer of the software receives a call from the layer above and may call to the next layer below. As you're already aware, this is a basic principle of O/S design, and so the O/S is full of wrappers. We're just inserting one more layer --- this time between the user app and the glibc layer. I still don't fully understand what you mean by "collaboration", but it sounds like your definition reduces to the the use of system call wrappers. In that case, I agree that if DMTCP were not allowed to use system call wrappers, then DMTCP would fall apart. Aside from that almost tautology, I don't understand why system call wrappers are inherently bad. Glibc puts system call wrappers around almost every kernel system call. Glibc even reserves two signals solely for its own use. By the way, for those who wish to inspect the DMTCP wrappers, I'd like to add to my pointers to DMTCP wrappers. the relevant DMTCP code, is in: dmtcp/src/execwrappers.cpp dmtcp/src/miscwrappers.cpp dmtcp/src/pidwrappers.cpp dmtcp/src/signalwrappers.cpp dmtcp/src/socketwrappers.cpp dmtcp/src/syscallsreal.c dmtcp/src/syscallwrappers.h dmtcp/src/uniquepid.cpp dmtcp/src/virtualpidtable.cpp The total line count is probably 4,500 lines of code, which includes about 500 lines of copyright statement (LGPL), #include and other boring boiler-plating. I apologize for the shorter listing in my earlier post. I didn't intend to mislead. There's lots of other DMTCP code concerned with what to do at the time of checkpoint and restart, but that would be a different story. > >Basically, if _transparent_ means > >that one is not allowed to use anything at all from userland, then I > >agree with you that no userland checkpointing can ever be transparent. > >But, I think that's a biased definition of _transparent_. :-) > > "Transparent" c/r means "invisible" to the user/apps, i.e. that > you don't restrict the user or the app in what they do and how > they do it. > > Did you ever try to 'ltrace skype' ? there exists useful and > popular software that doesn't like being spied after... We have not tried to 'ltrace skype'. But ltrace is using PTRACE. Note that DMTCP does not use PTRACE. I imagine the more interesting question is if we ever tried 'dmtcp_checkpoint skype'. No, we have not, but it sounds like an interesting experiment. We'd love to do it, and discuss with you whatever we learn. In the offline discussion, perhaps we can take a shortcut and have you describe the skype tricks to us, so that we can give you a quick first guess. Anyway, there's one other obvious issue with skype for both Linux C/R and DMTCP. Skype is talking to a remote app that is probably not under checkpoint control. And even if both ends are under checkpoint control, Skype is probably not a good use case for C/R, but if it were, it might indeed be a difficult problem. (I'd have to think about it.) As before, remember that we are talking about two different approaches: - in-kernel C/R and capturing every possible application; - userland C/R and covering the actual use cases that one finds in practice You seem to be arguing that there is an important use case that a DMTCP userland approach can never cover. You may be right about such a use case, but that detailed back-and-forth will be easier to do offline; and then we can summarize for the list. We'll even _help you_ look for those difficult use cases. If they're there, we want to know about them, too. :-) Thanks and best wishes, - Gene ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-07 23:05 ` Gene Cooperman @ 2010-11-08 3:55 ` Oren Laadan 2010-11-08 16:26 ` Gene Cooperman 0 siblings, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-08 3:55 UTC (permalink / raw) To: Gene Cooperman Cc: Kapil Arya, Tejun Heo, ksummit-2010-discuss, linux-kernel, hch On 11/07/2010 06:05 PM, Gene Cooperman wrote: [snip] >>>> ... (yes, transparent means that >>>> it does not require LD_PRELOAD or collaboration of the application! >>>> nor does it require userspace virtualizations of so many things >>>> already provided by the kernel today), more generic, more flexible, >>>> provides more guarantees, cover more types or states of resources, >>>> and can perform significantly better. >>> >>> I still haven't understood why you object to the DMTCP use of LD_PRELOAD. >>> How will the user app ever know that we used LD_PRELOAD, since we remove >>> LD_PRELOAD from the environment before the user app libraries and main >>> can begin? And, if you really object to LD_PRELOAD, then there are >>> other ways to capture control. Similarly, I'll have to understand better >> >> I don't object to it per se - it's actually pretty useful oftentimes. >> But in our context, it has limitations. For example, it does not >> cover static applications, nor apps that call syscalls directly >> using int 0x80. > > For static apps, we would use other interposition techniques. And yes, > we haven't implemented support of static apps so far, because our > user base hasn't asked for it. We do handle apps that use the > syscall system call to make system calls. We don't handle apps > that directly use "int 0x80". Again, there are ways to do this, but > our user base hasn't asked for it. > In general, please keep in mind the principles that you rightly had > to remind me of in a previous post. :-) Our two pieces of work are coming > from two different directions with two different visions. Linux C/R wants > to be so transparent that no user app can ever detect it. DMTCP wants to be > transparent enough that any reasonable use case is covered. Agreed - as long as we are considering the c/r-engine functionality (and not the "glue" logic to keep apps outside their context after the restart). That said, I'm afraid we'll more definitions to what is "reasonable" than to what is "transparent"... > In particular, DMTCP considers distributed computations to be equally > valid use cases for the core DMTCP C/R. I also agree that Linux C/R can be > extended to cover distributed apps -- either through userland extensions, > or maybe with techniques like in your excellent CLUSTER-2005 paper. Distributed c/r is one of the proposed use-cases for linux-cr. The technique in that paper, BTW, was a userspace glue: during restart, that glue re-establishes connectivity by using new TCP connections, and c/r uses those new sockets in lieu of restoring the old ones. For that and other use-cases we designed linux-cr to be flexible so that it is possible and easy to integrate any userspace glue. >> Also, it conflicts with LD_PRELOAD possibly needed >> for other software (like valgrind) - for which again you would need >> yet another per-app wrapper, at the very least. > > DMTCP does not conflict with the fact that valgrind uses LD_PRELOAD. > We add dmtcphijack.so to the beginning of LD_PRELOAD before the user app > starts. We then remove it before the app really starts. The LD_PRELOAD > requests of valgrind continue to be honored. It all works. I stand corrected. >>> what you mean by the _collaboration of the application_. DMTCP operates >>> on unmodified application binaries. >> >> I mean that the applications needs to be scheduled and to run to >> participate in its own checkpoint. You use syscall interposition >> and signals games to do exactly that - gain control over the app >> and run your library's code. This has at least three negatives: >> first, some apps don't want to or can't run - e.g. ptraced, or >> swapped (think incremental checkpoint: why swap everything in ?!); >> Second, the coordination can take significant time, especially if >> many tasks/threads and resources are involved; Third, it modifies >> the state of the app - if something goes wrong while you use c/r >> to migrate an app, you impact the app. [snip] > If it helps, then think of a wrapper as just another function, > that calls an inner function. Object-oriented programming uses this > principle all the time. Similarly, the glibc wrapper around a kernel > API is just one more of these functions. Another way to view this is > through the idea of layers. Each layer of the software receives a call > from the layer above and may call to the next layer below. As you're > already aware, this is a basic principle of O/S design, and so > the O/S is full of wrappers. We're just inserting one more layer --- > this time between the user app and the glibc layer. Wrappers are great (I did TA the w4118 class here...). They are a powerful tool; however in _our_ context they have downsides: (a) wrappers add visible overhead (less so for cpu-bound apps, more so with server apps) (b) wrappers that do virtualization to a "black-box" API (as opposed to integrate with the API) are prone to races (see the paper that I cited before) (c) wrappers duplicate kernel logic, IMHO unnecessarily (and I don't refer to the userspace "glue" from above) (d) wrappers are hard to make hermetic (no escapes) to apps. IMO, the one excellent reasons to use wrappers is to support the userspace glue that allows restarted apps to run out of their original context. > > I still don't fully understand what you mean by "collaboration", but > it sounds like your definition reduces to the the use of system call > wrappers. In that case, I agree that if DMTCP were not allowed to use I clearly failed to explain well. Lemme try again: If you use PTRACE to checkpoint, then you ptrace the target tasks, peek at and save their state, and then let them resume execution. The target apps need not collaborate - they are forced by the kernel to the ptraced state regardless of what they were doing, and resume execution without knowing what happened. In linux-cr it works similarly: checkpoint does not require that the processes be scheduled to run - they don't participate; rather, external process(es) do the work. In contrast, IIUC, dmtcp uses syscall wrappers and overloading of signal(s) in order to make every checkpointed process/thread actively execute the checkpoint logic. I refer to this as "collaborating" with the checkpoint operation. (I mentioned the downside of this requirement above). > system call wrappers, then DMTCP would fall apart. Aside from that > almost tautology, I don't understand why system call wrappers are inherently > bad. Glibc puts system call wrappers around almost every kernel system call. > Glibc even reserves two signals solely for its own use. Again, I failed to deliver the message: syscall wrappers are not bad. They have limitations as noted above. Some users won't care, others may and do. As for glibc - those wrappers have a set of well defined tasks, e.g. set errno, hide underlying syscall, caching, threads etc. But glibc does not try to virtualize pids, for example, nor "spy" after the processes, so to speak. >>> Basically, if _transparent_ means >>> that one is not allowed to use anything at all from userland, then I >>> agree with you that no userland checkpointing can ever be transparent. >>> But, I think that's a biased definition of _transparent_. :-) >> >> "Transparent" c/r means "invisible" to the user/apps, i.e. that >> you don't restrict the user or the app in what they do and how >> they do it. >> >> Did you ever try to 'ltrace skype' ? there exists useful and >> popular software that doesn't like being spied after... > > We have not tried to 'ltrace skype'. But ltrace is using PTRACE. > Note that DMTCP does not use PTRACE. I imagine the more interesting question Oh... that's not what I meant: 'ltrace skype' fails because skype tries to protect itself from being reverse-engineered. It doesn't like ltrace's interposition on some library calls (don't know the details). (Note that PTRACE doesn't upset skype: 'strace skype' does work). The point being - userspace wrapping is "escapable". > is if we ever tried 'dmtcp_checkpoint skype'. No, we have not, but > it sounds like an interesting experiment. We'd love to do it, and > discuss with you whatever we learn. In the offline discussion, perhaps > we can take a shortcut and have you describe the skype tricks to us, > so that we can give you a quick first guess. No tricks - I once tried after a colleague mentioned that skype is hard to reverse engineer (I thought I could prove him wrong...). > Anyway, there's one other obvious issue with skype for both Linux C/R > and DMTCP. Skype is talking to a remote app that is probably not under > checkpoint control. Linux-cr can do live migration - e.g. VDI, move the desktop - in which case skype's sockets' network stacks are reconstructed, transparently to both skype (local apps) and the peer (remote apps). Then, at the destination host and skype continues to work. > And even if both ends are under checkpoint control, > Skype is probably not a good use case for C/R, but if it were, it might > indeed be a difficult problem. (I'd have to think about it.) > As before, remember that we are talking about two different approaches: > - in-kernel C/R and capturing every possible application; > - userland C/R and covering the actual use cases that one finds in practice I'd assume that if the c/r engine can do the former, then it will also do the latter. Maybe even it would be useful for dmtcp to be able to use a couple of syscalls (checkpoint,restart) to do the base c/r work :p Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 3:55 ` Oren Laadan @ 2010-11-08 16:26 ` Gene Cooperman 2010-11-08 18:14 ` Oren Laadan 2010-11-08 19:05 ` Dan Smith 0 siblings, 2 replies; 111+ messages in thread From: Gene Cooperman @ 2010-11-08 16:26 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Kapil Arya, Tejun Heo, ksummit-2010-discuss, linux-kernel, hch As before, Oren, let's have that phone discussion so that we can preprocess a lot of this, instead of acting like the the three blind men and the elephant. I will _tell you_ the strengths and weaknesses of DMTCP on the phone, instead of you having to guess at them here on LKML. And of course, I hope you will be similarly frank about Linux C/R on the phone. Thank you for lowering the heat on this last post. I'll reply only to some relevant issues in this post, rather than trying to respond to all of your posts. I remind you that I still have my own questions about Linux C/R, but I'm saving them for the phone discussion, since that will be more efficient, and result in less heat. > > If it helps, then think of a wrapper as just another function, > >that calls an inner function. Object-oriented programming uses this > >principle all the time. Similarly, the glibc wrapper around a kernel > >API is just one more of these functions. Another way to view this is > >through the idea of layers. Each layer of the software receives a call > >from the layer above and may call to the next layer below. As you're > >already aware, this is a basic principle of O/S design, and so > >the O/S is full of wrappers. We're just inserting one more layer --- > >this time between the user app and the glibc layer. > > Wrappers are great (I did TA the w4118 class here...). They are > a powerful tool; however in _our_ context they have downsides: > (a) wrappers add visible overhead (less so for cpu-bound apps, > more so with server apps) In our experience, the primary overhead of C/R is to save the data to disk. This far outweighs the question of how many ms one technique or another may require in a system call or in the kernel. > (b) wrappers that do virtualization to a "black-box" API (as > opposed to integrate with the API) are prone to races (see the > paper that I cited before) The paper you cited was: http://www.stanford.edu/~talg/papers/traps/abstract.html Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools That paper is about Sandboxing. DMTCP is about C/R. If DMTCP was trying to do a sandbox, it might have some of the same traps and pitfalls. Luckily, userland C/R is a _lot_ easier than userland sandboxing. By the way, although of less importance, I'll point out that the paper was written in 2003, before DMTCP even started. Next, you talk about races. The authors of that paper have races because they are trying to do sandboxing. I already answered Matt's post earlier about why we don't see races in DMTCP. I'll answer it again, but in more detail. At ordinary run-time, the DMTCP checkpoint thread is just waiting on a select -- waiting for instructions from the DMTCP coordinator. Our system call wrappers around user threads to not change the issue of races. If two user threads used to have a race, they will continue to do so in DMTCP. If two user threads did not have a race, then DMTCP will not introduce any new races. How should DMTCP introduce a new race when DMTCP wrappers _never_ communicate with any other thread. At checkpoint or restart time, the DMTCP checkpoint thread also runs. However, at checkpoint time, the first thing it does is to quiesce all the user threads by sending a signal and forcing them into a DMTCP signal handler. (And before we go down that other road again, I remind you that glibc also reserves two signals solely for the use of glibc. A user app can break glibc by using the glibc reserved signals.) During checkpoint-restart, the DMTCP checkpoint thread is then the _only_ thread that is executing. So, again, I don't see how a race could be introduced. Finally, the last thing the DMTCP checkpoint thread does is resume the user threads. The DMTCP checkpoint thread then goes back to waiting on select for a message from the DMTCP coordinator. > (c) wrappers duplicate kernel logic, IMHO unnecessarily (and I > don't refer to the userspace "glue" from above) DMTCP wrappers do not duplicate kernel logic. In our phone conversation, I will show you each and every one of the DMTCP wrappers. I've already posted for the entire list where they can find the DMTCP wrappers. I honestly don't see any duplication of kernel logic. If you do see this, please tell us which DMTCP wrapper is duplicating the kernel logic, so that we can talk about specifics. But please, can we review the DMTCP code offline? A code review within LKML seems _awfully_ tedious. :-) > (d) wrappers are hard to make hermetic (no escapes) to apps. In general, we don't try to make all the DMTCP wrappers hermetic. Your mindset may be influenced by the sandboxing paper above. But again, we're not doing sandboxing. We're doing C/R. If you're using "hermetic" as a placeholder for what we call "pid virtualization" (a translation table between original and current pid), then yes: for every system call that takes a pid as an argument or returns a pid, we must add a wrapper. That is not a difficult task. Let's do a code review of DMTCP together (on the phone) to look for a "leak" in the DMTCP pid's. I do think this is a lot easier and less complex to do than to guard against all resource leaks in a container. :-) (Sorry, I know that's a cheap shot on my part. I'm getting tired of overly broad statements, without the opportunity for us to do a code review or preprocess the issues back and forth on the phone.) > IMO, the one excellent reasons to use wrappers is to support > the userspace glue that allows restarted apps to run out of > their original context. > > > > >I still don't fully understand what you mean by "collaboration", but > >it sounds like your definition reduces to the the use of system call > >wrappers. In that case, I agree that if DMTCP were not allowed to use > > I clearly failed to explain well. Lemme try again: > > If you use PTRACE to checkpoint, then you ptrace the target tasks, > peek at and save their state, and then let them resume execution. > The target apps need not collaborate - they are forced by the kernel > to the ptraced state regardless of what they were doing, and resume > execution without knowing what happened. > > In linux-cr it works similarly: checkpoint does not require that > the processes be scheduled to run - they don't participate; rather, > external process(es) do the work. > > In contrast, IIUC, dmtcp uses syscall wrappers and overloading of > signal(s) in order to make every checkpointed process/thread actively > execute the checkpoint logic. I refer to this as "collaborating" > with the checkpoint operation. (I mentioned the downside of this > requirement above). Again, a correction. DMTCP does _not_ overload signals. It uses a signal not already used by the app. If the app tries to "zero out" all signals, then DMTCP protects itself through wrappers (or what you would call "lying", although I dislike these emotionally loaded phrases). Glibc also uses dedicated signals. Concerning "collaboration", when gdb inserts a breakpoint, it modifies the user code. So, even though gdb uses PTRACE, by your definition, the gdb use of breakpoints relies on "collaboration". > >system call wrappers, then DMTCP would fall apart. Aside from that > >almost tautology, I don't understand why system call wrappers are inherently > >bad. Glibc puts system call wrappers around almost every kernel system call. > >Glibc even reserves two signals solely for its own use. > > Again, I failed to deliver the message: syscall wrappers are not bad. > They have limitations as noted above. Some users won't care, others > may and do. > > As for glibc - those wrappers have a set of well defined tasks, > e.g. set errno, hide underlying syscall, caching, threads etc. But > glibc does not try to virtualize pids, for example, nor "spy" after > the processes, so to speak. I'm sorry to be blunt, but I simply have to say that you are wrong here. We've spent six years developing DMTCP. We've spent a lot of time getting to know the design principles of glibc. (And by the way, it's not just glibc that does these dirty tricks with system calls --- bash, dash, Matlab, and a host of other applications also do it.) Anyway, glibc definitely does have its own "dirty tricks", including "spy"-ing. Caching a pid and refusing to make a later system call is definitely a form of spying. It gets worse with glibc session ids and group ids. When a session id or group id changes, glibc must inform all of the user threads that their cached id has changed. To do this it uses the SETXID concept and a dedicated signal, as I mentioned earlier. At the time when the clone call was created, there was a dicussion whether to implement threads directly in the Linux kernel. It was decided to go with the clone call, instead. If I understand your general philosophy, that was a bad decision, because NPTL threads are no longer transparent, and they now require collaboration through wrappers in glibc. (Sorry, another cheap shot. Can we please shift the discussion to a phone conversation? If you're going to make me spend hours replying on LKML, when I could explain it all to you in one hour on the phone, then I will get cranky.) There are also other "dirty tricks" from glibc that I could bring out for you -- where one might argue that glibc breaks your definition of transparency. (However, the literature has lots of papers on "transparent checkpointing", and I think they use a different definition of transparency from yours.) With DMTCP and glibc both, the philosophy is that as long as the application coverage is broad enough, and as long as the tricks of DMTCP and glibc do not affect any programmer's natural programming methodology, then it's okay. This is not about sandboxing, or hermeticity. I understand that Linux C/R may have those higher goals, and that's laudable, but please don't tell us that DMTCP is bad because it doesn't do exactly what Linux C/R does. (Sorry, getting cranky, again.) > >>>Basically, if _transparent_ means > >>>that one is not allowed to use anything at all from userland, then I > >>>agree with you that no userland checkpointing can ever be transparent. > >>>But, I think that's a biased definition of _transparent_. :-) > >> > >>"Transparent" c/r means "invisible" to the user/apps, i.e. that > >>you don't restrict the user or the app in what they do and how > >>they do it. > >> > >>Did you ever try to 'ltrace skype' ? there exists useful and > >>popular software that doesn't like being spied after... > > > >We have not tried to 'ltrace skype'. But ltrace is using PTRACE. > >Note that DMTCP does not use PTRACE. I imagine the more interesting question > > Oh... that's not what I meant: 'ltrace skype' fails because skype > tries to protect itself from being reverse-engineered. It doesn't > like ltrace's interposition on some library calls (don't know the > details). (Note that PTRACE doesn't upset skype: 'strace skype' > does work). The point being - userspace wrapping is "escapable". > > >is if we ever tried 'dmtcp_checkpoint skype'. No, we have not, but > >it sounds like an interesting experiment. We'd love to do it, and > >discuss with you whatever we learn. In the offline discussion, perhaps > >we can take a shortcut and have you describe the skype tricks to us, > >so that we can give you a quick first guess. > > No tricks - I once tried after a colleague mentioned that skype is > hard to reverse engineer (I thought I could prove him wrong...). > > > Anyway, there's one other obvious issue with skype for both Linux C/R > >and DMTCP. Skype is talking to a remote app that is probably not under > >checkpoint control. > > Linux-cr can do live migration - e.g. VDI, move the desktop - in > which case skype's sockets' network stacks are reconstructed, > transparently to both skype (local apps) and the peer (remote apps). > Then, at the destination host and skype continues to work. That's a really cool thing to do, and it's definitely not part of what DMTCP does. It might be possible to do userland live migration, but it's definitely not part of our current scope. But if we're talking about live migration, have you also looked at the work of Andres Lagar Caviilla on SnowFlock? http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf He does live migration of entire virtual machines, again with very small delay. Of course, the issue for any type of live migration is that if the rate of dirtying pages is very high (e.g. HPC), then there is still a delay or slow response, due to page faults to a remote host. > >And even if both ends are under checkpoint control, > >Skype is probably not a good use case for C/R, but if it were, it might > >indeed be a difficult problem. (I'd have to think about it.) > > As before, remember that we are talking about two different approaches: > >- in-kernel C/R and capturing every possible application; > >- userland C/R and covering the actual use cases that one finds in practice > > I'd assume that if the c/r engine can do the former, then it > will also do the latter. Maybe even it would be useful for dmtcp > to be able to use a couple of syscalls (checkpoint,restart) to > do the base c/r work :p Yes, we have no objection to combining ideas from DMTCP and Linux C/R. This is not a case of either-or. Let's study the use cases together. I won't say more, because I'm clearly getting cranky right now. :-) > Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 16:26 ` Gene Cooperman @ 2010-11-08 18:14 ` Oren Laadan 2010-11-08 18:37 ` Gene Cooperman 2010-11-08 19:05 ` Dan Smith 1 sibling, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-08 18:14 UTC (permalink / raw) To: Gene Cooperman Cc: Kapil Arya, Tejun Heo, ksummit-2010-discuss, linux-kernel, hch, Linux Containers Hi, Ok, I'll bite the bullet for now - to be continued... Just one important clarification: >> Linux-cr can do live migration - e.g. VDI, move the desktop - in >> which case skype's sockets' network stacks are reconstructed, >> transparently to both skype (local apps) and the peer (remote apps). >> Then, at the destination host and skype continues to work. > > That's a really cool thing to do, and it's definitely not part of what > DMTCP does. It might be possible to do userland live migration, > but it's definitely not part of our current scope. But if we're talking > about live migration, have you also looked at the work of > Andres Lagar Caviilla on SnowFlock? > http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf > He does live migration of entire virtual machines, again with very > small delay. Of course, the issue for any type of live migration is that > if the rate of dirtying pages is very high (e.g. HPC), then there is > still a delay or slow response, due to page faults to a remote host. VMware, Xen and KVM already do live migration. However, VMs are a separate beast. We are concerned about _application_ level c/r and migration (complete containers or individual applications). Many proven techniques from the VM world apply to our context too (in your example, post-copy migration). Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 18:14 ` Oren Laadan @ 2010-11-08 18:37 ` Gene Cooperman 2010-11-08 19:34 ` Oren Laadan 0 siblings, 1 reply; 111+ messages in thread From: Gene Cooperman @ 2010-11-08 18:37 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Kapil Arya, Tejun Heo, ksummit-2010-discuss, linux-kernel, hch, Linux Containers Thanks for the careful response, Oren. For others who read this, one could interpret Oren's rapid post as criticizing the work of Andres Lagar Cavilla. I'm sure that this was not Oren's intention. Please read below for a brief clarification of the novelty of SnowFlock. Anyway, I really look forward to the phone discussion. I've also enjoyed our interchange, for giving me an opportunity to explain more about the DMTCP design. Thank you. Best wishes, - Gene On Mon, Nov 08, 2010 at 01:14:12PM -0500, Oren Laadan wrote: > Hi, > > Ok, I'll bite the bullet for now - to be continued... > > Just one important clarification: > > >>Linux-cr can do live migration - e.g. VDI, move the desktop - in > >>which case skype's sockets' network stacks are reconstructed, > >>transparently to both skype (local apps) and the peer (remote apps). > >>Then, at the destination host and skype continues to work. > > > >That's a really cool thing to do, and it's definitely not part of what > >DMTCP does. It might be possible to do userland live migration, > >but it's definitely not part of our current scope. But if we're talking > >about live migration, have you also looked at the work of > >Andres Lagar Caviilla on SnowFlock? > > http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf > >He does live migration of entire virtual machines, again with very > >small delay. Of course, the issue for any type of live migration is that > >if the rate of dirtying pages is very high (e.g. HPC), then there is > >still a delay or slow response, due to page faults to a remote host. > > VMware, Xen and KVM already do live migration. However, VMs > are a separate beast. I absolutely agree with your point that live migration of applications is a different beast, and technically very novel. Since I know Andres Lagar Cavilla personally, I also feel obligated to comment why SnowFlock truly is novel in the VM space. First, as Andres writes: "SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3 VMM [Barham 2003]." In the abstract, Andres points out one of the major points of novelty: "To evaluate SnowFlock, we focus on the demanding scenario of services requiring on-the-fly creation of hundreds of parallel workers in order to solve computationallyintensive queries in seconds." We must be careful that we don't destroy someone's reputation without a careful study of their work. > We are concerned about _application_ level c/r and migration > (complete containers or individual applications). Many proven > techniques from the VM world apply to our context too (in your > example, post-copy migration). > > Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 18:37 ` Gene Cooperman @ 2010-11-08 19:34 ` Oren Laadan 0 siblings, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-08 19:34 UTC (permalink / raw) To: Gene Cooperman; +Cc: Kapil Arya, Tejun Heo, linux-kernel, hch, Linux Containers On 11/08/2010 01:37 PM, Gene Cooperman wrote: > Thanks for the careful response, Oren. For others who read this, > one could interpret Oren's rapid post as criticizing the work of > Andres Lagar Cavilla. I'm sure that this was not Oren's intention. > Please read below for a brief clarification of the novelty of SnowFlock. Err... yes, that was careless of me. I was too focused on getting the thread back to track. Thanks for pointing out. >>> about live migration, have you also looked at the work of >>> Andres Lagar Caviilla on SnowFlock? >>> http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf >>> He does live migration of entire virtual machines, again with very >>> small delay. Of course, the issue for any type of live migration is that >>> if the rate of dirtying pages is very high (e.g. HPC), then there is >>> still a delay or slow response, due to page faults to a remote host. >> >> VMware, Xen and KVM already do live migration. However, VMs >> are a separate beast. > > I absolutely agree with your point that live migration of > applications is a different beast, and technically very novel. > Since I know Andres Lagar Cavilla personally, I also feel obligated > to comment why SnowFlock truly is novel in the VM space. First, as Andres > writes: > "SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3 > VMM [Barham 2003]." > In the abstract, Andres points out one of the major points of novelty: > "To evaluate SnowFlock, we focus on the demanding > scenario of services requiring on-the-fly creation of hundreds > of parallel workers in order to solve computationallyintensive > queries in seconds." > We must be careful that we don't destroy someone's reputation without > a careful study of their work. Yes, it's really nice work - I saw it when I visited there. (Coincidentally the post-copy idea with Xen appeared also in VEE 09 briefly before). Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 16:26 ` Gene Cooperman 2010-11-08 18:14 ` Oren Laadan @ 2010-11-08 19:05 ` Dan Smith 2010-11-17 11:14 ` Tejun Heo 1 sibling, 1 reply; 111+ messages in thread From: Dan Smith @ 2010-11-08 19:05 UTC (permalink / raw) To: Gene Cooperman Cc: Oren Laadan, Kapil Arya, Tejun Heo, ksummit-2010-discuss, linux-kernel, hch GC> As before, Oren, let's have that phone discussion so that we can GC> preprocess a lot of this, instead of acting like the the three GC> blind men and the elephant. I will _tell you_ the strengths and GC> weaknesses of DMTCP on the phone, instead of you having to guess GC> at them here on LKML. And of course, I hope you will be similarly GC> frank about Linux C/R on the phone. I want to be in on that discussion too, as do a lot of other people here. However, I doubt we'll all be able to find a common spot on our collective schedules, nor would that conversation be archived for posterity. I think sticking to LKML is the right (and time-tested) approach. OL> Linux-cr can do live migration - e.g. VDI, move the desktop - in OL> which case skype's sockets' network stacks are reconstructed, OL> transparently to both skype (local apps) and the peer (remote OL> apps). Then, at the destination host and skype continues to work. GC> That's a really cool thing to do, and it's definitely not part of GC> what DMTCP does. It might be possible to do userland live GC> migration, but it's definitely not part of our current scope. How would you go about doing that in userland? With the current linux-cr implementation, I can move something like sshd or sendmail from one machine to another without a remote (connected) client noticing anything more than a bit of delay during the move. I think that saving and restoring the state of a TCP connection from userland is probably a good example of a case where it makes sense to have it as part of a C/R function, but not necessarily exposed in /sys or /proc somewhere. Unless it can be argued that doing so is not useful, I think that's a good talking point for discussing the kernel vs. user approach, no? -- Dan Smith IBM Linux Technology Center ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 19:05 ` Dan Smith @ 2010-11-17 11:14 ` Tejun Heo 2010-11-17 15:33 ` Dan Smith 0 siblings, 1 reply; 111+ messages in thread From: Tejun Heo @ 2010-11-17 11:14 UTC (permalink / raw) To: Dan Smith Cc: Gene Cooperman, Oren Laadan, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch Hello, On 11/08/2010 08:05 PM, Dan Smith wrote: > GC> As before, Oren, let's have that phone discussion so that we can > GC> preprocess a lot of this, instead of acting like the the three > GC> blind men and the elephant. I will _tell you_ the strengths and > GC> weaknesses of DMTCP on the phone, instead of you having to guess > GC> at them here on LKML. And of course, I hope you will be similarly > GC> frank about Linux C/R on the phone. > > I want to be in on that discussion too, as do a lot of other people > here. However, I doubt we'll all be able to find a common spot on our > collective schedules, nor would that conversation be archived for > posterity. I think sticking to LKML is the right (and time-tested) > approach. Amen. > OL> Linux-cr can do live migration - e.g. VDI, move the desktop - in > OL> which case skype's sockets' network stacks are reconstructed, > OL> transparently to both skype (local apps) and the peer (remote > OL> apps). Then, at the destination host and skype continues to work. > > GC> That's a really cool thing to do, and it's definitely not part of > GC> what DMTCP does. It might be possible to do userland live > GC> migration, but it's definitely not part of our current scope. > > How would you go about doing that in userland? With the current > linux-cr implementation, I can move something like sshd or sendmail > from one machine to another without a remote (connected) client > noticing anything more than a bit of delay during the move. > > I think that saving and restoring the state of a TCP connection from > userland is probably a good example of a case where it makes sense to > have it as part of a C/R function, but not necessarily exposed in /sys > or /proc somewhere. Unless it can be argued that doing so is not > useful, I think that's a good talking point for discussing the kernel > vs. user approach, no? Meh, just implementing a conntrack module should be good enough for most use cases. If it ever becomes a general enough problem (which I extremely strongly doubt), we can think about allowing processes in a netns to change sequence number but that would be a single setsockopt option instead of the horror show of dumping in-kernel data structures in binary blob. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 11:14 ` Tejun Heo @ 2010-11-17 15:33 ` Dan Smith 2010-11-17 15:40 ` Tejun Heo 0 siblings, 1 reply; 111+ messages in thread From: Dan Smith @ 2010-11-17 15:33 UTC (permalink / raw) To: Tejun Heo Cc: Gene Cooperman, Oren Laadan, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch TH> If it ever becomes a general enough problem (which I extremely TH> strongly doubt), Migration of a container? Yeah, it's one of the primary reasons for doing what we're doing :) TH> we can think about allowing processes in a netns to change TH> sequence number but that would be a single setsockopt option Yeah, well there's more than that, of course, if you want to be able to checkpoint a socket in any state. Buffers, time-wait, etc. TH> instead of the horror show of dumping in-kernel data structures in TH> binary blob. Well, as should be evident from a review of the code, we don't dump binary kernel data structures as a general rule. We canonicalize them into checkpoint headers on the way out and build the new data structures (or use existing kernel interfaces to do so) on the way in. You know, just like netlink does. It has even been suggested that we do this with netlink instead, to mirror the other "horror show" tools that we all use on a daily basis. We're not opposed to this, but we do have some concerns about performance. -- Dan Smith IBM Linux Technology Center email: danms@us.ibm.com ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 15:33 ` Dan Smith @ 2010-11-17 15:40 ` Tejun Heo 2010-11-17 17:04 ` Alexey Dobriyan 0 siblings, 1 reply; 111+ messages in thread From: Tejun Heo @ 2010-11-17 15:40 UTC (permalink / raw) To: Dan Smith Cc: Gene Cooperman, Oren Laadan, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch Hello, On 11/17/2010 04:33 PM, Dan Smith wrote: > TH> If it ever becomes a general enough problem (which I extremely > TH> strongly doubt), > > Migration of a container? Yeah, it's one of the primary reasons for > doing what we're doing :) Well, then push for the feature. If the rationale is strong enough, it'll get in. > TH> we can think about allowing processes in a netns to change > TH> sequence number but that would be a single setsockopt option > > Yeah, well there's more than that, of course, if you want to be able > to checkpoint a socket in any state. Buffers, time-wait, etc. I haven't really thought about it too deeply but for all other misc states, you should be able to emulate it by talking to a netfilter module. The reason why I suggested sequence number changing setsocket option is because that is the only performance sensitive part and with that you should be able to resume live sockets without conntracking. For cold paths, using netfilter module during resume should do, right? > TH> instead of the horror show of dumping in-kernel data structures in > TH> binary blob. > > Well, as should be evident from a review of the code, we don't dump > binary kernel data structures as a general rule. We canonicalize them > into checkpoint headers on the way out and build the new data > structures (or use existing kernel interfaces to do so) on the way in. > You know, just like netlink does. netlink interaction is defined by ABI. > It has even been suggested that we do this with netlink instead, to > mirror the other "horror show" tools that we all use on a daily basis. > We're not opposed to this, but we do have some concerns about > performance. The horror show part is dumping internal data structure without due scrutinization in a way which can only ever be useful for CR when most of the same states are already exported via ABI defined ways. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 15:40 ` Tejun Heo @ 2010-11-17 17:04 ` Alexey Dobriyan 0 siblings, 0 replies; 111+ messages in thread From: Alexey Dobriyan @ 2010-11-17 17:04 UTC (permalink / raw) To: Tejun Heo Cc: Dan Smith, Gene Cooperman, Oren Laadan, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch On Wed, Nov 17, 2010 at 5:40 PM, Tejun Heo <tj@kernel.org> wrote: > The horror show part is dumping internal data structure without due > scrutinization in a way which can only ever be useful for CR when most > of the same states are already exported via ABI defined ways. That's what review process is for, isn't it? Please, look at what is being dumped and what isn't. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 0:36 ` Kapil Arya 2010-11-06 22:55 ` Oren Laadan @ 2010-11-17 10:45 ` Tejun Heo 2010-11-17 12:12 ` Tejun Heo 1 sibling, 1 reply; 111+ messages in thread From: Tejun Heo @ 2010-11-17 10:45 UTC (permalink / raw) To: Kapil Arya Cc: Gene Cooperman, Oren Laadan, ksummit-2010-discuss, linux-kernel, hch Hello, sorry about the long delay. Was lost in something else. On 11/06/2010 01:36 AM, Kapil Arya wrote: >> I'm probably missing something but can't you stop the application >> using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry >> about -EINTR failures (there are some exceptions but nothing really to >> worry about). Also, unless the manager thread needs to be always >> online, you can inject manager thread by manipulating the target >> process states while taking a snapshot. > > In fact CryoPid uses exactly the same approach and has been around > for around 5 years. Not as much development effort has gone into > CryoPid as DMTCP and so its application coverage is not as > broad. But the larger issue for using PTRACE is that you can not > have two superiors tracing the same inferior process. So if you want > to checkpoint a gdb session or valgrind or tmux or strace, then you > can not directly control and quiesce the inferior process being > traced. I've been thinking about this. We can easily introduce a new ptrace call which allows neseting. AFAICS, ptrace already exports most of information necessary to restart the task - where it's stopped and why. The only missing thing seems to be the wait state (including for group stop) which can be added without too much difficulty. I'll try to write up a RFC patch. Things like that would useful for other things too - say, you would be able to attach gdb to a strace'd process which would come handy in some cases. > Beyond that, we also have a vision (not yet implemented) of process > virtualization by which one can change the behavior of a > program. For example, if a distributed computation runs over > infiniband, can we migrate to a TCP/IP cluster. For this, one needs > the flexibility of wrappers around system calls. This vision of > process virtualization also motivates why our own research project > has steered away from in-kernel C/R. Yeah, definitely, for the higher level workarounds, there's no way around it but I think it would still be worthwhile to be able to provide a baseline implementation which can checkpoint and restart a single process in a reliable and well-defined way. >>> But since you ask :-), there is one thing on our wish list. We >>> handle address space randomization, vdso, vsyscall, and so on quite >>> well. We do not turn off address space randomization (although on >>> restart, we map user segments back to their original addresses). >>> Probably the randomized value of brk (end-of-data or end of heap) is >>> the thing that gave us the most troubles and that's where the code >>> is the most hairy. >> >> Can you please elaborate a bit? What do you want to see changed? > > Yes, we would love to elaborate :-). We began DMTCP with Linux > kernel 2.6.3. When Address Space Layout Randomization was added, we > were forced to add some hacks concerning VDSO location and > end-of-data. end-of-data is the uglier part. On restart, we > directly map each memory segment into the original address at > checkpoint time. The issue comes in mapping heap back to its > original location. We call sbrk() to reset the end-of-data to the > end of the original heap. This fails if the randomized > beginning-of-data/end-of-data given to us by the kernel for the > restarted process is too far away from where we want to remap the > heap. To get around this, we play games with legacy layout, other > personality parameters, and RLIMIT_STACK (since the kernel uses > RLIMIT_STACK in choosing the appropriate memory layout). > > For our wish list, we would like a way of telling the kernel, where > to set beginning-of-data/end-of-data. Curiously enough, at the time > at which Linux started randomizing address space, there was > discussion of offering exactly this facility for the sake of legacy > programs, but it turned out not to be needed. I see. Yeah, I completely forgot that kernel keeps track of brk. > Similarly, it would be nice to tell the kernel where we want the > VDSO page. Currently, we get around this by keeping two VDSO pages, > the old one which we restore and the new one specified to us by the > kernel when the restart process is created. This works well for, and > so controlling the address of the VDSO page is less important for > us. I haven't really looked at the VDSO generation but symbol offsets inside VDSO page can differ depending on kernel version, configuration, toolchains used, etc... right? You would need an extra layer of indirection no matter what in that case. > Since /proc/*/net provides a simpler design for sockets, we started > wondering what other simplifications may be possible. Here is one > possibility, in the case of shared file descriptors, DMTCP goes > through two barriers in order to decide which process will be > responsible for checkpointing which shared-file descriptor. It works > and the overhead is reasonable, but if you have additional > suggestion for this case, we would be very interested. I wrote in another mail but you can find out which fd's are shared by flipping O_NONBLOCK and looking at the flags field of /proc/*/fdinfo/*. Or are you talking about something else? > We really enjoyed this discussion. If you are interested, we would > be happy to talk further by phone in order to take advantage of the > higher bandwidth. As a few others have already pointed out, I think it's better to keep technical discussions on-line. Different people think at different paces and the schedules don't always match. Plus, other people can jump in and look up things later. It may take a bit more effort at the beginning but I think it gets easier in time. Thank you. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 10:45 ` Tejun Heo @ 2010-11-17 12:12 ` Tejun Heo 0 siblings, 0 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-17 12:12 UTC (permalink / raw) To: Kapil Arya Cc: Gene Cooperman, Oren Laadan, ksummit-2010-discuss, linux-kernel, hch On 11/17/2010 11:45 AM, Tejun Heo wrote: >> Since /proc/*/net provides a simpler design for sockets, we started >> wondering what other simplifications may be possible. Here is one >> possibility, in the case of shared file descriptors, DMTCP goes >> through two barriers in order to decide which process will be >> responsible for checkpointing which shared-file descriptor. It works >> and the overhead is reasonable, but if you have additional >> suggestion for this case, we would be very interested. > > I wrote in another mail but you can find out which fd's are shared by > flipping O_NONBLOCK and looking at the flags field of > /proc/*/fdinfo/*. Or are you talking about something else? Ooh, one more thing, /proc/*/net/* has tx/rx queue counts. With those, you wouldn't need the cookie based connection draining, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-05 9:28 ` Tejun Heo 2010-11-05 23:18 ` Oren Laadan 2010-11-06 0:36 ` Kapil Arya @ 2010-11-06 5:32 ` Matt Helsley 2010-11-06 15:01 ` Oren Laadan 2010-11-06 20:40 ` Gene Cooperman 2 siblings, 2 replies; 111+ messages in thread From: Matt Helsley @ 2010-11-06 5:32 UTC (permalink / raw) To: Tejun Heo Cc: Gene Cooperman, Kapil Arya, Oren Laadan, ksummit-2010-discuss, linux-kernel, hch On Fri, Nov 05, 2010 at 10:28:09AM +0100, Tejun Heo wrote: > Hello, > > On 11/04/2010 05:44 PM, Gene Cooperman wrote: > >>> In our personal view, a key difference between in-kernel and userland > >>> approaches is the issue of security. > >> > >> That's an interesting point but I don't think it's a dealbreaker. > >> ... but it's not like CR is gonna be deployed on > >> majority of desktops and servers (if so, let's talk about it then). > > > > This is a good point to clarify some issues. C/R has several good > > targets. For example, BLCR has targeted HPC batch facilities, and > > does it well. > > > > DMTCP started life on the desktop, and it's still a primary focus of > > DMTCP. We worked to support screen on this release precisely so > > that advanced desktop users have the option of putting their whole > > screen session under checkpoint control. It complements the core > > goal of screen: If you walk away from a terminal, you can get back > > the session elsewhere. If your session crashes, you can get back > > the session elsewhere (depending on where you save the checkpoint > > files, of course :-) ). > > Call me skeptical but I still don't see, yet, it being a mainstream > thing (for average sysadmin John and proverbial aunt Tilly). It > definitely is useful for many different use cases tho. Hey, but let's > see. Rightly so. It hasn't been widely proven as something that distros would be willing to integrate into a normal desktop session. We've got some demos of it working with VNC, twm, and vim. Oren has his own VNC, twm, etc demos too. We haven't looked very closely at more advanced desktop sessions like (in no particular order) KDE or Gnome. Nor have we yet looked at working with any portions of X that were meant to provide this but were never popular enough to do so (XSMP iirc). Does DMTCP handle KDE/Gnome sessions? X too? On the kernel side of things for the desktop, right now we think our biggest obstacle is inotify. I've been working on kernel patches for kernel-cr to do that and it seems fairly do-able. Does DMTCP handle restarting inotify watches without dropping events that were present during checkpoint? The other problem for kernel c/r of X is likely to be DRM. Since the different graphics chipsets vary so widely there's nothing we can do to migrate DRM state of an NVIDIA chipset to DRM state of an ATI chipset as far as I know. Perhaps if that would help hybrid graphics systems then it's something that could be common between DRM and checkpoint/restart but it's very much pie-in-the-sky at the moment. kernel c/r of input devices might be alot easier. We just simulate hot [un]plug of the devices and rely on X responding. We can even checkpoint the events X would have missed and deliver them prior to hot unplug. Also, how does DMTCP handle unlinked files? They are important because lots of process open a file in /tmp and then unlink it. And that's not even the most difficult case to deal with. How does DMTCP handle: link a to b open a (stays open) rm a <checkpoint and restart> open b write to b read from a (the write must appear) ? > > > These are also some excellent points for discussion! The manager thread > > is visible. For example, if you run a gdb session under checkpoint > > control (only available in our unstable branch, currently), then > > the gdb session will indeed see the checkpoint manager thread. > > I don't think gdb seeing it is a big deal as long as it's hidden from > the application itself. Is the checkpoint control process hidden from the application? What happens if it gets killed or dies in the middle of checkpoint? Can a malicious task being checkpointed (perhaps for later analysis) kill it? Or perhaps it runs as root or a user with special capabilities? > > > We try to hid the reserved signal (SIGUSR2 by default, but the user Mess. > > can configure it to anything else). We put wrappers around system > > calls that might see our signal handler, but I'm sure there are > > cases where we might not succeed --- and so a skilled user would > > have to configure to use a different signal handler. And of course, > > there is the rare application that repeatedly resets _every_ signal. > > We encountered this in an earlier version of Maple, and the Maple > > developers worked with us to open up a hole so that we could > > checkpoint Maple in future versions. > > > >> [while] all programs should be ready to handle -EINTR failure from system > >> calls, it's something which is very difficult to verify and test and > >> could lead to once-in-a-blue-moon head scratchy kind of failures. > > > > Exactly right! Excellent point. Perhaps this gets down to > > philosophy, and what is the nature of a bug. :-) In some cases, we > > have encountered this issue. Our solution was either to refuse to > > checkpoint within certain system calls, or to check the return value > > and if there was an -EINTR, then we would re-execute the system > > call. This works again, because we are using wrappers around many > > (but not all) of the system calls. > > I'm probably missing something but can't you stop the application > using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry Wouldn't checkpoint and gdb interfere then since the kernel only allows one task to attach? So if DMTCP is checkpointing something and uses this solution then you can't debug it. If a user is debugging their process then DMTCP can't checkpoint it. > about -EINTR failures (there are some exceptions but nothing really to > worry about). Also, unless the manager thread needs to be always > online, you can inject manager thread by manipulating the target > process states while taking a snapshot. Ugh. Frankly it sounds like we're being asked to pin our hopes on a house of cards -- weird userspace hacks involving extra processes, hodge-podge combinations of ptrace, LD_PRELOAD, signal hijacking, brk hacks, scanning passes in /proc (possibly at numerous times which begs for races), etc. When all is said and done, my suspicion is all of it will be a mess that shows races which none of the [added] kernel interfaces can fix. In contrast, kernel-based cr is rather straight forward when you bother to read the patches. It doesn't require using combinations of obscure userspace interfaces to intercept and emulate those very same interfaces. It doesn't add a scattered set of new ABIs. And any races would be in a a syscall where they could likely be fixed without adding yet-more ABIs all over the place. > > But since you ask :-), there is one thing on our wish list. We > > handle address space randomization, vdso, vsyscall, and so on quite > > well. We do not turn off address space randomization (although on > > restart, we map user segments back to their original addresses). > > Probably the randomized value of brk (end-of-data or end of heap) is > > the thing that gave us the most troubles and that's where the code > > is the most hairy. > > Can you please elaborate a bit? What do you want to see changed? > > > The implementation is reasonably modularized. In the rush to > > address bugs or feature requirements of users, we sometimes cut > > corners. We intend to go back and fix those things. Roughly, the > > architecture of DMTCP is to do things in two layers: MTCP handles a > > single multi-threaded process. There is a separate library mtcp.so. > > The higher layer (redundantly again called DMTCP) is implemented in > > dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot of > > what would be done within kernel C/R. But the higher DMTCP layer > > takes on some of those responsibilities in places. For example, > > DMTCP does part of analyzing the pseudo-ttys, since it's not always > > easy to ensure that it's the controlling terminal of some process > > that can checkpoint things in the MTCP layer. > > > > Beyond that, the wrappers around system calls are essentially > > perfectly modular. Some system calls go together to support a > > single kernel feature, and those wrappers are kept in a common file. > > I see. I just thought that it would be helpful to have the core part > - which does per-process checkpointing and restoring and corresponds > to the features implemented by in-kernel CR - as a separate thing. It > already sounds like that is mostly the case. > > I don't have much idea about the scope of the whole thing, so please > feel free to hammer senses into me if I go off track. From what I > read, it seems like once the target process is stopped, dmtcp is able > to get most information necessary from kernel via /proc and other > methods but the paper says that it needs to intercept socket related > calls to gather enough information to recreate them later. I'm > curious what's missing from the current /proc. You can map socket to > inode from /proc/*/fd which can be matched to an entry in > /proc/*/net/PROTO to find out the addresses and most socket options > should be readable via getsockopt. Am I missing something? > > I think this is why userland CR implementation makes much more sense. One forseeable future is nested containers. How will this house of cards work if we wish to checkpoint a container that is itself performing a checkpoint? We've thought about the nested container case and designed our interfaces so that they won't change for that case. What happens if any of these new interfaces get used for non-checkpoint purposes and then we wish to checkpoint those tasks? Will we need any more interfaces for that? We definitely don't want two wind up with an ABI that looks like a Russian Doll. > Most of states visible to a userland process are rather rigidly > defined by standards and, ultimately, ABI and the kernel exports most > of those information to userland one way or the other. Given the > right set of needed features, most of which are probabaly already > implemented, a userland implementation should have access to most > information necessary to checkpoint without resorting to too messy So you agree it will be a mess (Just not "too messy"). I have no idea what you think "too messy" is, but given all the stuff proposed so far I'd say you've reached that point already. > methods and then there inevitably needs to be some workarounds to make > CR'd processes behave properly w.r.t. other states on the system, so > userland workarounds are inevitable anyway unless it resorts to > preemtive separation using namespaces and containers, which I frankly Huh? I am not sure what you mean by "preemptive separation using namespaces and containers". Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 5:32 ` Matt Helsley @ 2010-11-06 15:01 ` Oren Laadan 2010-11-06 20:40 ` Gene Cooperman 1 sibling, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-06 15:01 UTC (permalink / raw) To: Matt Helsley Cc: Tejun Heo, Gene Cooperman, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch On 11/06/2010 01:32 AM, Matt Helsley wrote: > On Fri, Nov 05, 2010 at 10:28:09AM +0100, Tejun Heo wrote: >> Hello, >> >> On 11/04/2010 05:44 PM, Gene Cooperman wrote: >>>>> In our personal view, a key difference between in-kernel and userland >>>>> approaches is the issue of security. >>>> >>>> That's an interesting point but I don't think it's a dealbreaker. >>>> ... but it's not like CR is gonna be deployed on >>>> majority of desktops and servers (if so, let's talk about it then). >>> >>> This is a good point to clarify some issues. C/R has several good >>> targets. For example, BLCR has targeted HPC batch facilities, and >>> does it well. >>> >>> DMTCP started life on the desktop, and it's still a primary focus of >>> DMTCP. We worked to support screen on this release precisely so >>> that advanced desktop users have the option of putting their whole >>> screen session under checkpoint control. It complements the core >>> goal of screen: If you walk away from a terminal, you can get back >>> the session elsewhere. If your session crashes, you can get back >>> the session elsewhere (depending on where you save the checkpoint >>> files, of course :-) ). >> >> Call me skeptical but I still don't see, yet, it being a mainstream >> thing (for average sysadmin John and proverbial aunt Tilly). It >> definitely is useful for many different use cases tho. Hey, but let's >> see. > > Rightly so. It hasn't been widely proven as something that distros > would be willing to integrate into a normal desktop session. We've got > some demos of it working with VNC, twm, and vim. Oren has his own VNC, > twm, etc demos too. We haven't looked very closely at more advanced > desktop sessions like (in no particular order) KDE or Gnome. Nor have > we yet looked at working with any portions of X that were meant to provide > this but were never popular enough to do so (XSMP iirc). Actually, I do have a demo of Zap (linux-cr predecessor) with a _full_ gnome desktop running under VNC with: * a movie player, * firefox, * thunderbird, * openoffice, * kernel make, * gdb debugging something, * WINE with microsoft office (oops) all of these checkpointed with < 25ms of downtime and resumed an arbitrary time later, successfully. I even have witnesses that saw it ;) > > Does DMTCP handle KDE/Gnome sessions? X too? > > On the kernel side of things for the desktop, right now we think our > biggest obstacle is inotify. I've been working on kernel patches for > kernel-cr to do that and it seems fairly do-able. Does DMTCP handle > restarting inotify watches without dropping events that were present > during checkpoint? > At the very least userspace would need to interpose on all inotify related syscalls to track (log) what the user did to be able to redo it at restart. (And I'm sure there will be crazy to impossible races and corner cases there). Does it make sense to replicate in userspace everything already done in the kernel ? > The other problem for kernel c/r of X is likely to be DRM. Since the > different graphics chipsets vary so widely there's nothing we can do > to migrate DRM state of an NVIDIA chipset to DRM state of an ATI chipset > as far as I know. Perhaps if that would help hybrid graphics systems > then it's something that could be common between DRM and > checkpoint/restart but it's very much pie-in-the-sky at the moment. DRM is hardware, and is complex for both userspace and kernel. Let's assume it isn't support until it's properly virtualized. (In the long-long run, I'd envision hardware manufacturers providing c/r support within their drivers - e.g. a checkpoint() and restart() kernel methods. But that's only if they care about it, and in any event, pretty far down the road...) > kernel c/r of input devices might be alot easier. We just simulate > hot [un]plug of the devices and rely on X responding. We can even > checkpoint the events X would have missed and deliver them prior to hot > unplug. > [snip] Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 5:32 ` Matt Helsley 2010-11-06 15:01 ` Oren Laadan @ 2010-11-06 20:40 ` Gene Cooperman 2010-11-06 22:41 ` Oren Laadan 2010-11-07 21:44 ` Oren Laadan 1 sibling, 2 replies; 111+ messages in thread From: Gene Cooperman @ 2010-11-06 20:40 UTC (permalink / raw) To: Matt Helsley Cc: Tejun Heo, Gene Cooperman, Kapil Arya, Oren Laadan, ksummit-2010-discuss, linux-kernel, hch By the way, Oren, Kapil and I are hoping to find time in the next few days to talk offline. Apparently the Linux C/R and DMTCP had continued for some years unaware of each other. We appreciate that a huge amount of work has gone into both of the approaches, and so we'd like to reap the benefit of the experiences of the two approaches. We're still learning more about each others' approaches. Below, I'll try to answer as best I can the questions that Matt brings up. Since Matt brings up _lots_ of questions, and I add my own topics, I thought it best to add a table of contents to this e-mail. For each topic, you'll see a discussion inline below. 1. Distros, checkpointing a desktop, KDE/Gnome, X [ Trying to answer Matt's question ] 2. Directly checkpointing a single X11 app [ Our own preferred approach, as opposed to checkpinting an entire desktop; This is easy, but we just haven't had the time lately. I estimate the time to do it is about one person working straight out for two weeks or so. But who has that much spare time. :-) ] 3. OpenGL [ Checpointing OpenGL would be a really big win. We don't know the right way, but we're looking. Do you have some thoughts on that? Thanks.] 4. inotify and NSCD [ We try to virtualize a single app, instead of also checkpointing inotify and NSCD themselves. It would have been interesting to consider checkpointing them in userland, but that would require root privilege, and one core design principle we have, is that all of our C/R is completely unprivileged. So, we would see distributing DMTCP as a package in a distro, and letting individual users decide for what computation they might want to use it. ] 5. Checkpointing DRM state and other graphics chip state [ It comes down to virtualization around a single app versus checkpointing _all_ of X. --- Two different approaches. ] 6. kernel c/r of input devices might be alot easier [ We agree with you. By virtualizing around a single app, we hope to avoid this issue. ] 7. C/R for link/open/rm/open/write/read puzzle 8. What happens if the DMTCP coordinator ( checkpoint control process) dies? [ The same thing that happens if a user process dies. We kill the whole computation, and restart. At restart, we use a new coordinator. Coordinators are stateless. ] 9. We try to hide the reserved signal (SIGUSR2 by default) ... [ Matt says this is a mess, but we note that glibc does this too. ] 10. checkpoint, gdb and PTRACE_ATTACH [ DMTCP does not use PTRACE_ATTACH in its implementation. So, we can and do fully support user processes that use PTRACE_ATTACH. ] 11. DMTCP, ABIs, can there be a race condition between the ckpt thread and user threads of an app? [ DMTCP doesn't introduce any new ABIs. There may be a misconception here. If we can talk at length off-line, I could explain more about the DMTCP design. Inline, I explain why race conditions should not be an issue. ] 12. nested containers, ABIs, etc. [ see inline comment ] 13. a userland implementation should have access to most information necessary to checkpoint without resorting to too messy [ In fact, the primary ABIs that we use outside of system calls are /proc/*/maps and /proc/*/fd. Even here, we would have workarounds if someone took those ABIs away. ] The full range of comments is inline below. Sorry that this e-mail is getting so long. There are many things to talk about. I hope to later take advantage of the higher bandwidth with Oren (by phone) to thrash out some of these things together. Thanks, - Gene On Fri, Nov 05, 2010 at 10:32:04PM -0700, Matt Helsley wrote: > On Fri, Nov 05, 2010 at 10:28:09AM +0100, Tejun Heo wrote: > > Hello, > > > > On 11/04/2010 05:44 PM, Gene Cooperman wrote: > > >>> In our personal view, a key difference between in-kernel and userland > > >>> approaches is the issue of security. > > >> > > >> That's an interesting point but I don't think it's a dealbreaker. > > >> ... but it's not like CR is gonna be deployed on > > >> majority of desktops and servers (if so, let's talk about it then). > > > > > > This is a good point to clarify some issues. C/R has several good > > > targets. For example, BLCR has targeted HPC batch facilities, and > > > does it well. > > > > > > DMTCP started life on the desktop, and it's still a primary focus of > > > DMTCP. We worked to support screen on this release precisely so > > > that advanced desktop users have the option of putting their whole > > > screen session under checkpoint control. It complements the core > > > goal of screen: If you walk away from a terminal, you can get back > > > the session elsewhere. If your session crashes, you can get back > > > the session elsewhere (depending on where you save the checkpoint > > > files, of course :-) ). > > > > Call me skeptical but I still don't see, yet, it being a mainstream > > thing (for average sysadmin John and proverbial aunt Tilly). It > > definitely is useful for many different use cases tho. Hey, but let's > > see. > > Rightly so. It hasn't been widely proven as something that distros > would be willing to integrate into a normal desktop session. We've got > some demos of it working with VNC, twm, and vim. Oren has his own VNC, > twm, etc demos too. We haven't looked very closely at more advanced > desktop sessions like (in no particular order) KDE or Gnome. Nor have > we yet looked at working with any portions of X that were meant to provide > this but were never popular enough to do so (XSMP iirc). > > Does DMTCP handle KDE/Gnome sessions? X too? 1. Distros, checkpointing a desktop, KDE/Gnome, X DMTCP does checkpoint VNC sessions with a desktop, KDE/Gnome, and X. We were doing that in some joint work with SCIRun: http://www.sci.utah.edu/cibc/software/106-scirun.html SCIRun only works under X, and so it was an absolute prerequisite. SCIRun optionally also likes to use OpenGL (3-D graphics). We had hacked up something for OpenGL 1.5, and I write more on that, below. However, we agree with you that a distro would probably not want to run C/R under their regular X session. If anything minor fails, it hurts their reputation, which is everything for them. So, think that's a non-starter. The other possibility is to use C/R on a VNC session for an X desktop. We also think that most users would not care for the extra complication of having two desktops (one under checkpoint control, and the main one). One can run an individual X11 application under VNC and checkpoint the VNC. We can and _do_ do that. But it's still unsatisfying for us. The heaviness and added complexity of checkpointing a VNC server makes us nervous. 2. Directly checkpointing a single X11 app So, as I said in a different post, we're planning to virtualize directly around libX11.so and libxcb.so. Then we'll checkpoint the X11 graphic application and _only_ the X11 graphic application. We think that a really cool advantage of this approach is that if you checkpoint the X11 app under Gnome, then you can bring it back to life under KDE, and it will now have the look-and-feel of KDE. Another advantage of this approach is that there's a single desktop shared by all applications. If the X11 application wishes to use dbus, a window manager, or whatever, to communicate with other X11 apps, it can continue to do so. Our virtualization approach should work well when interaction goes through a small enough library around which we can place wrappers. The library can be libc.so, libX11.so, or any of many other libraries. This also seems more modular to us. A VNC server has to worry about _lots_ of things, and we only need the connect/disconnect portion of the VNC server. It's not hard to implement that directly in a small library. Also, if we checkpoint fewer processes, the time to write to disk is smaller. 3. OpenGL We had hacked up something for OpenGL 1.5 with the intention of supporting SCIRun. It was based on the work of: http://andres.lagarcavilla.com/publications/LagarCavillaVEE07.pdf http://andres.lagarcavilla.com/vmgl/index.html The problem was that OpenGL is growing and adding system calls faster than one can virtualize them. :-) We didn't want to always be chasing around to support the newest addition to OpenGL. Have you also looked at checkpointing OpenGL? It's an interesting question. Unfortunately, I doubt that the vendors will support C/R in their video drivers, and so we're forced to look for a different solution (or give up, and we don't like giving up :-) ). > On the kernel side of things for the desktop, right now we think our > biggest obstacle is inotify. I've been working on kernel patches for > kernel-cr to do that and it seems fairly do-able. Does DMTCP handle > restarting inotify watches without dropping events that were present > during checkpoint? 4. inotify and NSCD We have run into inotify. We don't try to checkpoint inotify itself. Instead, as with X11 apps, our larger interest is in checkpointing a single computation that might have been interacting with inotify, and then be able to restart the single app and resume talking with inotify. The situation is similar to that with NSCD (Network Services Caching Daemon). If you wish to checkpoint a single application, and if it was talking to the Network Services Caching Daemon, how do you handle that? Is it that you always checkpoint both the app and the NSCD at the same time? If so, perhaps this is a key difference in the two approaches: virtualize around a single app; or checkpoint _every_ process that is interacting with the process of interest. But I'm just speculating, and I need to talk more with you all to understand better. > The other problem for kernel c/r of X is likely to be DRM. Since the > different graphics chipsets vary so widely there's nothing we can do > to migrate DRM state of an NVIDIA chipset to DRM state of an ATI chipset > as far as I know. Perhaps if that would help hybrid graphics systems > then it's something that could be common between DRM and > checkpoint/restart but it's very much pie-in-the-sky at the moment. 5. Checkpointing DRM state and other graphics chip state Again, this may come down to virtualization around a single application versus checkpointing everything. We would try to avoid the necessity of checkpointing graphics drivers, DRM issues, etc., through virtualization. As I wrote above, though, we don't yet have a good virtualization solution when it comes to OpenGL. So, we're very interested in any thoughts you have about handling OpenGL. > kernel c/r of input devices might be alot easier. We just simulate > hot [un]plug of the devices and rely on X responding. We can even > checkpoint the events X would have missed and deliver them prior to hot > unplug. 6. kernel c/r of input devices might be alot easier I think I would agree. As indicated above, our philosphy is to virtualize the single app, instead of "checkpointing the world", as one of our team, Jason Ansel, used to like to say. :-) But this is not to say that checkpointing the entire X with input devices isn't also interesting. The two works are complementary. > Also, how does DMTCP handle unlinked files? They are important because > lots of process open a file in /tmp and then unlink it. And that's not > even the most difficult case to deal with. How does DMTCP handle: > > link a to b > open a (stays open) > rm a > <checkpoint and restart> > open b > write to b > read from a (the write must appear) > > ? 7. C/R for link/open/rm/open/write/read puzzle We did have some similar issues ing like this in some of the apps we looked at. For example, if my memory is right, in an app that works with the NSCD daemon, it mmaps a shared file, and then unlinks the file so that the file will be deleted when the app exits. Just to make sure that everything is precise, would you mind writing a short app like that and sending it to us? For example, I'm guessing the link is a symbolic link, but the actual code will make it all precise. We'll directly perform the experiment you propose and tell you the result. I think the short story will be that we have a command-line option by which the user specifies if they would like to checkpoint open files. We also have heuristics to try to do the right thing when the user didn't give us specific instructions on the command line. The short answer is that we're driven by the use cases we encounter, and we think of application coverage. You may be right that we don't currently cover this, but I would like to try it first, and verify. If you have an important use case for this scenario, we will definitely add coverage for it. Maybe this is another difference in philosophy. Oren talked about full transparency --- meaning that the kernel will always present the illusion of continuity to an app. Because we know the design of DMTCP, we know of ways that a userland app could create weird cases where the wrong things happen. When we discover an app that needs the weird case, we expand our coverage through additional virtualization. > > > These are also some excellent points for discussion! The manager thread > > > is visible. For example, if you run a gdb session under checkpoint > > > control (only available in our unstable branch, currently), then > > > the gdb session will indeed see the checkpoint manager thread. > > > > I don't think gdb seeing it is a big deal as long as it's hidden from > > the application itself. > > Is the checkpoint control process hidden from the application? What > happens if it gets killed or dies in the middle of checkpoint? Can > a malicious task being checkpointed (perhaps for later analysis) > kill it? Or perhaps it runs as root or a user with special capabilities? 8. What happens if the DMTCP coordinator ( checkpoint control process) dies If the checkpoint control process dies, then the checkpoint manager thread in the user app never hears from the coordinator again. The application continues anyway without failing. But, it's no longer possible to checkpoint that application. Again, I think it's a difference in philosophy. We want to checkpoint a single app or computation. If that computation loses _any_ of its processes (whether it's the DMTCP coordinator process or one of the application processes itself), then it's best to kill the compuation and restart from the last checkpoint image. Our DMTCP coordinator is stateless, and so it's no problem to create a new DMTCP coordinator at the time of restart. > > > We try to hid the reserved signal (SIGUSR2 by default, but the user > Mess. 9. We try to hide the reserved signal (SIGUSR2 by default Beauty is in the eye of the beholder. :-) I remind you that libc reserves SIGRTMIN and SIGRTMIN + 1 for thread cancellation and for setxid, respectively. If reserving a signal is bad, then libc.so is also a "Mess". In the glibc source, look at: ./nptl/pthreadP.h: #define SIGCANCEL __SIGRTMIN ./nptl/pthreadP.h: #define SIGSETXID (__SIGRTMIN + 1) Probably glibc is even worse than us. They use the signal, and they _don't_ hide it from the user. Userland is a messy place. :-) > > > can configure it to anything else). We put wrappers around system > > > calls that might see our signal handler, but I'm sure there are > > > cases where we might not succeed --- and so a skilled user would > > > have to configure to use a different signal handler. And of course, > > > there is the rare application that repeatedly resets _every_ signal. > > > We encountered this in an earlier version of Maple, and the Maple > > > developers worked with us to open up a hole so that we could > > > checkpoint Maple in future versions. > > > > > >> [while] all programs should be ready to handle -EINTR failure from system > > >> calls, it's something which is very difficult to verify and test and > > >> could lead to once-in-a-blue-moon head scratchy kind of failures. > > > > > > Exactly right! Excellent point. Perhaps this gets down to > > > philosophy, and what is the nature of a bug. :-) In some cases, we > > > have encountered this issue. Our solution was either to refuse to > > > checkpoint within certain system calls, or to check the return value > > > and if there was an -EINTR, then we would re-execute the system > > > call. This works again, because we are using wrappers around many > > > (but not all) of the system calls. > > > > I'm probably missing something but can't you stop the application > > using PTRACE_ATTACH? You wouldn't need to hijack a signal or worry > > Wouldn't checkpoint and gdb interfere then since the kernel only allows > one task to attach? So if DMTCP is checkpointing something and uses this > solution then you can't debug it. If a user is debugging their process then > DMTCP can't checkpoint it. 10. checkpoint, gdb and PTRACE_ATTACH As a design decision, DMTCP never traces a process. We did this so we could easily checkpoint a gdb session without worrying about gdb and DMTCP both trying to trace the gdb target process. > > about -EINTR failures (there are some exceptions but nothing really to > > worry about). Also, unless the manager thread needs to be always > > online, you can inject manager thread by manipulating the target > > process states while taking a snapshot. > > Ugh. Frankly it sounds like we're being asked to pin our hopes on > a house of cards -- weird userspace hacks involving extra > processes, hodge-podge combinations of ptrace, LD_PRELOAD, signal > hijacking, brk hacks, scanning passes in /proc (possibly at numerous > times which begs for races), etc. > > When all is said and done, my suspicion is all of it will be a mess > that shows races which none of the [added] kernel interfaces can fix. > > In contrast, kernel-based cr is rather straight forward when you bother > to read the patches. It doesn't require using combinations of obscure > userspace interfaces to intercept and emulate those very same interfaces. > It doesn't add a scattered set of new ABIs. And any races would be in a > a syscall where they could likely be fixed without adding yet-more ABIs > all over the place. 11. DMTCP, ABIs, can there be a race condition between the ckpt thread and user threads of an app? DMTCP does not add any new ABIs. But maybe I misunderstood your point. The only potential races I can see are between the checkpoint thread and the user threads. But the checkpoint thread does nothing except listen for a command from the coordinator. When the command comes, it first quiesces the user threads, before doing anything. All of those wrappers for virtualization that we refer to are executed by the ordinary _user_ threads. The checkpoint thread is in a select system call during that entire time. > > > But since you ask :-), there is one thing on our wish list. We > > > handle address space randomization, vdso, vsyscall, and so on quite > > > well. We do not turn off address space randomization (although on > > > restart, we map user segments back to their original addresses). > > > Probably the randomized value of brk (end-of-data or end of heap) is > > > the thing that gave us the most troubles and that's where the code > > > is the most hairy. > > > > Can you please elaborate a bit? What do you want to see changed? > > > > > The implementation is reasonably modularized. In the rush to > > > address bugs or feature requirements of users, we sometimes cut > > > corners. We intend to go back and fix those things. Roughly, the > > > architecture of DMTCP is to do things in two layers: MTCP handles a > > > single multi-threaded process. There is a separate library mtcp.so. > > > The higher layer (redundantly again called DMTCP) is implemented in > > > dmtcphijack.so. In a _very_ rough kind of way, MTCP does a lot of > > > what would be done within kernel C/R. But the higher DMTCP layer > > > takes on some of those responsibilities in places. For example, > > > DMTCP does part of analyzing the pseudo-ttys, since it's not always > > > easy to ensure that it's the controlling terminal of some process > > > that can checkpoint things in the MTCP layer. > > > > > > Beyond that, the wrappers around system calls are essentially > > > perfectly modular. Some system calls go together to support a > > > single kernel feature, and those wrappers are kept in a common file. > > > > I see. I just thought that it would be helpful to have the core part > > - which does per-process checkpointing and restoring and corresponds > > to the features implemented by in-kernel CR - as a separate thing. It > > already sounds like that is mostly the case. > > > > I don't have much idea about the scope of the whole thing, so please > > feel free to hammer senses into me if I go off track. From what I > > read, it seems like once the target process is stopped, dmtcp is able > > to get most information necessary from kernel via /proc and other > > methods but the paper says that it needs to intercept socket related > > calls to gather enough information to recreate them later. I'm > > curious what's missing from the current /proc. You can map socket to > > inode from /proc/*/fd which can be matched to an entry in > > /proc/*/net/PROTO to find out the addresses and most socket options > > should be readable via getsockopt. Am I missing something? > > > > I think this is why userland CR implementation makes much more sense. > > One forseeable future is nested containers. How will this house of cards > work if we wish to checkpoint a container that is itself performing a > checkpoint? We've thought about the nested container case and designed > our interfaces so that they won't change for that case. > > What happens if any of these new interfaces get used for non-checkpoint > purposes and then we wish to checkpoint those tasks? Will we need any > more interfaces for that? We definitely don't want two wind up with an > ABI that looks like a Russian Doll. 12. nested containers, ABIs, etc. I think we would need to elaborate with individual cases. But as I wrote above, DMTCP and Linux C/R started with two different philosophies. I'm not sure if you fully understood the DMTCP goals and philosophy yet, but I hope my comments above help clarify it. > > Most of states visible to a userland process are rather rigidly > > defined by standards and, ultimately, ABI and the kernel exports most > > of those information to userland one way or the other. Given the > > right set of needed features, most of which are probabaly already > > implemented, a userland implementation should have access to most > > information necessary to checkpoint without resorting to too messy > > So you agree it will be a mess (Just not "too messy"). I have no > idea what you think "too messy" is, but given all the stuff proposed > so far I'd say you've reached that point already. 13. a userland implementation should have access to most information necessary to checkpoint without resorting to too messy If it helps, DMTCP began with Linux 2.6.3, and we continue to support Linux 2.6.9. In fact, DMTCP seems to uncover a bug in Linux 2.6.9 and maybe in Linux 2.6.18, or perhaps in the NFS implementation on top of it. We've experience some reproducible O/S instability when doing C/R in certain of those environments. :-) But we mostly use newer kernels now, where the reliability is truly excellent. Anyway, I suspect most of these ABIs and kernel exports that you mention did not exist in Linux 2.6.9. We don't depend on them. The ABIs that we use outside of system calls are: /proc/*/maps /proc/*/fd If those ABIs were taken away, we have other ways to virtualize and get the information that we need. > > methods and then there inevitably needs to be some workarounds to make > > CR'd processes behave properly w.r.t. other states on the system, so > > userland workarounds are inevitable anyway unless it resorts to > > preemtive separation using namespaces and containers, which I frankly > > Huh? I am not sure what you mean by "preemptive separation using > namespaces and containers". > > Cheers, > -Matt Helsley ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 20:40 ` Gene Cooperman @ 2010-11-06 22:41 ` Oren Laadan 2010-11-07 18:49 ` Gene Cooperman 2010-11-07 21:44 ` Oren Laadan 1 sibling, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-06 22:41 UTC (permalink / raw) To: Gene Cooperman Cc: Matt Helsley, Tejun Heo, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch On 11/06/2010 04:40 PM, Gene Cooperman wrote: > By the way, Oren, Kapil and I are hoping to find time in the next few > days to talk offline. Apparently the Linux C/R and DMTCP had continued That was my understanding too. However, I also felt that I'd better clarify a key point first. > for some years unaware of each other. We appreciate that a huge amount > of work has gone into both of the approaches, and so we'd like to reap > the benefit of the experiences of the two approaches. We're still learning > more about each others' approaches. Below, I'll try to answer as best > I can the questions that Matt brings up. Since Matt brings up _lots_ > of questions, and I add my own topics, I thought it best to add a table > of contents to this e-mail. For each topic, you'll see a discussion > inline below. [snip] > 2. Directly checkpointing a single X11 app > [ Our own preferred approach, as opposed to checkpinting an entire desktop; > This is easy, but we just haven't had the time lately. I estimate > the time to do it is about one person working straight out for two weeks > or so. But who has that much spare time. :-) ] Hmmm... that sounds pretty fast .. given that you will need to save and reconstruct an arbitrary state kept by the X server... More importantly, this line of thought was brought up in this thread multiple times, yet in a very misleading way. The question is _not_ whether one can do c/r of a single apps without their surrounding environment. The answer for that is simple: it _is_ possible either using proper (and more likely per-app) wrappers, or by adapting the apps to tolerate that. The above is entirely orthogonal to whether the c/r is in kernel or in userspace. So for terminal based apps, one can use 'screen'. For individual X apps, one can use a light VNC server with proper embedding in the desktop (e.g. metavnc). Or you could use screen-for-X like 'xpra'. Or you can write wrappers (messy or hairy or not) that will try to do that, or you could modify the apps. IIUC, dmtcp chose the way of the wrappers. But that is independent of where you do c/r ! The issue on the table is whether the _core_ c/r should go in kernel or userspace. Those wrappers of dmtcp are great and will be useful with either approach. So let us please _not_ argue that only one approach can c/r apps or processes out of their context. That is inaccurate and misleading. And while one may argue that one use-case is more important than another, let us also _not_ dismiss such use cases (as was argued by others in this thread). For example, c/r of a full desktop session in VNC, or a VPS, is a perfectly valid and useful case. [snip] > 4. inotify and NSCD > [ We try to virtualize a single app, instead of also checkpointing > inotify and NSCD themselves. It would have been interesting to consider > checkpointing them in userland, but that would require root privilege, > and one core design principle we have, is that all of our C/R is > completely unprivileged. So, we would see distributing DMTCP as > a package in a distro, and letting individual users decide for > what computation they might want to use it. ] FYI, inotify() is a syscall and does not require root privileges. It's a kernel API used to get notifications of changes to file system inodes. for instance, it's commonly used by file managers (e.g. nautilus). > > 5. Checkpointing DRM state and other graphics chip state > [ It comes down to virtualization around a single app versus checkpointing > _all_ of X. --- Two different approaches. ] > > 6. kernel c/r of input devices might be alot easier > [ We agree with you. By virtualizing around a single app, we hope > to avoid this issue. ] Back to the point argued above, "virtualization around a single app" are the wrappers that allow to take an app out of context and sort of implant it in another context. It's a very desirable feature, but orthogonal to the c/r technique. > > 7. C/R for link/open/rm/open/write/read puzzle > > 8. What happens if the DMTCP coordinator ( checkpoint control process) dies? > [ The same thing that happens if a user process dies. We kill the whole > computation, and restart. At restart, we use a new coordinator. > Coordinators are stateless. ] > > 9. We try to hide the reserved signal (SIGUSR2 by default) ... > [ Matt says this is a mess, but we note that glibc does this too. ] > > 10. checkpoint, gdb and PTRACE_ATTACH > [ DMTCP does not use PTRACE_ATTACH in its implementation. So, we can > and do fully support user processes that use PTRACE_ATTACH. ] Hmm... can you really c/r from userspace a process that was, at checkpoint time, in a ptrace-stopped state at an arbitrary kernel ptrace-hook ? I strongly suspect the answer is "no", definitely not unless you also virtualize and replicate the entire in-kernel ptrace functionality in userspace, > > 11. DMTCP, ABIs, can there be a race condition between the ckpt thread and > user threads of an app? > [ DMTCP doesn't introduce any new ABIs. There may be a misconception here. > If we can talk at length off-line, I could explain more about > the DMTCP design. Inline, I explain why race conditions should > not be an issue. ] I beg to differ. Virtualization that relies on a "black box" (in the sense that it works around an API but not integrated into the API, like dmtcp does) has been shown time and again to be racy. The common term is TOCTTOU races. See "Traps and Pitfalls: Practical Problems in System Call Interposition Based Security Tools" for example (http://www.stanford.edu/~talg/papers/traps/abstract.html), and many others that cite (or not) this work. I believe the way dmtcp virtualizes the pid-namespace makes no exception to this rule. [snip] > > I think we would need to elaborate with individual cases. But as I wrote > above, DMTCP and Linux C/R started with two different philosophies. > I'm not sure if you fully understood the DMTCP goals and philosophy yet, > but I hope my comments above help clarify it. Yes, let's look into the goals: dmtcp aims to provide c/r for a certain class of applications and envrionments. For this dmtcp offers: (1) userspace c/r engine and c/r-oriented virtualization, and (2) userspace (often per-application or per-environment) wrappers. linux-cr provides (3) generic, transparent kernel-based c/r engine (yes, transparent! without userspace virtualization, LD_PRELOAD tricks, or collaboration of the developer/application/user). So let's compare apples to apples - let's compare (3) to (1). All of the work related to item (2) applies to and benefits from either. (Now looking forward to discuss more details with dmtcp team on Tuesday and on :) Thanks, Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 22:41 ` Oren Laadan @ 2010-11-07 18:49 ` Gene Cooperman 2010-11-07 21:59 ` Oren Laadan 0 siblings, 1 reply; 111+ messages in thread From: Gene Cooperman @ 2010-11-07 18:49 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Matt Helsley, Tejun Heo, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch > > 2. Directly checkpointing a single X11 app > > [ Our own preferred approach, as opposed to checkpinting an entire desktop; > > This is easy, but we just haven't had the time lately. I estimate > > the time to do it is about one person working straight out for two weeks > > or so. But who has that much spare time. :-) ] > > Hmmm... that sounds pretty fast .. given that you will need to > save and reconstruct an arbitrary state kept by the X server... > > More importantly, this line of thought was brought up in this > thread multiple times, yet in a very misleading way. > > The question is _not_ whether one can do c/r of a single apps > without their surrounding environment. The answer for that is > simple: it _is_ possible either using proper (and more likely > per-app) wrappers, or by adapting the apps to tolerate that. > > The above is entirely orthogonal to whether the c/r is in kernel > or in userspace. These are all good points by Oren. It's not about in-kernel _or_ userland. There are opportunities to use both -- each where it is strongest, and I'm looking forward to that discussion with Oren. I do think that reconstructing the state of the X server is not as hard as Oren paints it, but let's talk about that in the discussion. > But that is independent of where you do c/r ! The issue on the > table is whether the _core_ c/r should go in kernel or userspace. > Those wrappers of dmtcp are great and will be useful with either > approach. > > So let us please _not_ argue that only one approach can c/r apps > or processes out of their context. That is inaccurate and misleading. > > And while one may argue that one use-case is more important than > another, let us also _not_ dismiss such use cases (as was argued > by others in this thread). For example, c/r of a full desktop > session in VNC, or a VPS, is a perfectly valid and useful case. I agree. I apologize if I was too argumentative in the previous post. > FYI, inotify() is a syscall and does not require root privileges. It's > a kernel API used to get notifications of changes to file system inodes. > for instance, it's commonly used by file managers (e.g. nautilus). Yes, I know. I was writing too fast in trying to respond to all the points. Matt had asked how we would handle inotify(), but I was getting swamped by all the questions. There is a virtualization approach to inotify in which one puts wrappers around inotify_add_watch(), inotify_rm_watch() and friends in the same way as we wrap open() and could wrap close(). One would then need to wrap read() (which we don't like to do, just in case it could add significant overhead). But if we consider kernel and userland virtualization together, then something similar to TIOCSTI for ioctl would allow us to avoid wrapping read(). > Back to the point argued above, "virtualization around a single app" > are the wrappers that allow to take an app out of context and sort of > implant it in another context. It's a very desirable feature, but > orthogonal to the c/r technique. I agree. I look forward to the discussion where we can put all this into a single perspective. > Hmm... can you really c/r from userspace a process that was, at > checkpoint time, in a ptrace-stopped state at an arbitrary kernel > ptrace-hook ? I strongly suspect the answer is "no", definitely > not unless you also virtualize and replicate the entire in-kernel > ptrace functionality in userspace, Let's try it and see. If you write a program, we'll try it out in DMTCP (unstable branch) and see. So far, checkpointing gdb sessions has worked well for us. If there is something we don't cover, it will be helpful to both of us to find it, and analyze that case. > I beg to differ. Virtualization that relies on a "black box" (in > the sense that it works around an API but not integrated into the > API, like dmtcp does) has been shown time and again to be racy. The > common term is TOCTTOU races. See "Traps and Pitfalls: Practical > Problems in System Call Interposition Based Security Tools" for > example (http://www.stanford.edu/~talg/papers/traps/abstract.html), > and many others that cite (or not) this work. > > I believe the way dmtcp virtualizes the pid-namespace makes no > exception to this rule. Another excellent topic for discussion. I look forward to the discussion. Thanks for the advance pointer for us to take a look at. > Yes, let's look into the goals: > > dmtcp aims to provide c/r for a certain class of applications and > envrionments. For this dmtcp offers: > (1) userspace c/r engine and c/r-oriented virtualization, and > (2) userspace (often per-application or per-environment) wrappers. > > linux-cr provides (3) generic, transparent kernel-based c/r engine > (yes, transparent! without userspace virtualization, LD_PRELOAD > tricks, or collaboration of the developer/application/user). > > So let's compare apples to apples - let's compare (3) to (1). > All of the work related to item (2) applies to and benefits > from either. > > (Now looking forward to discuss more details with dmtcp team on > Tuesday and on :) Also a very good point above, and I agree. The offline discussion should be a better forum for putting this all into perspective. Thanks again for your thoughtful response, - Gene ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-07 18:49 ` Gene Cooperman @ 2010-11-07 21:59 ` Oren Laadan 2010-11-17 11:57 ` Tejun Heo 0 siblings, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-07 21:59 UTC (permalink / raw) To: Gene Cooperman Cc: Matt Helsley, Tejun Heo, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers [cc'ing linux containers mailing list] On 11/07/2010 01:49 PM, Gene Cooperman wrote: [snip] > Matt had asked how we would handle inotify(), but I was getting swamped > by all the questions. There is a virtualization approach to inotify in which > one puts wrappers around inotify_add_watch(), inotify_rm_watch() and > friends in the same way as we wrap open() and could wrap close(). > One would then need to wrap read() (which we don't like to do, just This sounds like reimplementation in userspace the very same logic done by the kernel :) > in case it could add significant overhead). But if we consider kernel > and userland virtualization together, then something similar to TIOCSTI > for ioctl would allow us to avoid wrapping read(). We could work to add ABIs and APIs for each and every possible piece of state that affects userspace. And for each we'll argue forever about the design and some time later regret that it wasn't designed correctly :p Even if that happens (which is very unlikely and unnecessary), it will generate all the very same code in the kernel that Tejun has been complaining about, and _more_. And we will still suffer from issues such as lack of atomicity and being unable to do many simple and advanced optimizations. Or we could use linux-cr for that: do the c/r in the kernel, keep the know-how in the kernel, expose (and commit to) a per-kernel-version ABI (not vow to keep countless new individual ABIs forever after getting them wrongly...), be able to do all sorts of useful optimization and provide atomicity and guarantees (see under "leak detection" in the OLS linux-cr paper). Also, once the c/r infrastructure is in the kernel, it will be easy (and encouraged) to support new =ly introduced features. Finally, then we would use dmtcp as well as other tools on top of the kernel-cr - and I'm looking forward to do that ! [snip] >> Hmm... can you really c/r from userspace a process that was, at >> checkpoint time, in a ptrace-stopped state at an arbitrary kernel >> ptrace-hook ? I strongly suspect the answer is "no", definitely >> not unless you also virtualize and replicate the entire in-kernel >> ptrace functionality in userspace, > > Let's try it and see. If you write a program, we'll try it out in > DMTCP (unstable branch) and see. So far, checkpointing gdb sessions > has worked well for us. If there is something we don't cover, it will > be helpful to both of us to find it, and analyze that case. Try "strace bash" :) I suspect it won't work - and for the reasons I described. [snip] >> (Now looking forward to discuss more details with dmtcp team on >> Tuesday and on :) > > Also a very good point above, and I agree. The offline discussion should > be a better forum for putting this all into perspective. > > Thanks again for your thoughtful response, Same here. Talk to you soon... Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-07 21:59 ` Oren Laadan @ 2010-11-17 11:57 ` Tejun Heo 2010-11-17 15:39 ` Serge E. Hallyn 2010-11-17 22:17 ` Matt Helsley 0 siblings, 2 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-17 11:57 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Matt Helsley, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers Hello, Oren. On 11/07/2010 10:59 PM, Oren Laadan wrote: > We could work to add ABIs and APIs for each and every possible piece > of state that affects userspace. And for each we'll argue forever > about the design and some time later regret that it wasn't designed > correctly :p I'm sorry but in-kernel CR already looks like a major misdesign to me. > Even if that happens (which is very unlikely and unnecessary), > it will generate all the very same code in the kernel that Tejun > has been complaining about, and _more_. And we will still suffer > from issues such as lack of atomicity and being unable to do many > simple and advanced optimizations. It may be harder but those will be localized for specific features which would be useful for other purposes too. With in-kernel CR, you're adding a bunch of intrusive changes which can't be tested or used apart from CR. > Or we could use linux-cr for that: do the c/r in the kernel, > keep the know-how in the kernel, expose (and commit to) a > per-kernel-version ABI (not vow to keep countless new individual > ABIs forever after getting them wrongly...), be able to do all > sorts of useful optimization and provide atomicity and guarantees > (see under "leak detection" in the OLS linux-cr paper). Also, > once the c/r infrastructure is in the kernel, it will be easy > (and encouraged) to support new =ly introduced features. And the only reason it seems easier is because you're working around the ABI problem by declaring that these binary blobs wouldn't be kept compatible between different kernel versions and configurations. That simply is the wrong approach. If you want to export something, build it properly into ABI. > Finally, then we would use dmtcp as well as other tools on top > of the kernel-cr - and I'm looking forward to do that ! Yeah, this part I agree. The higher level workarounds implemented in dmtcp are quite impressive and useful no matter what happens to lower layer. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 11:57 ` Tejun Heo @ 2010-11-17 15:39 ` Serge E. Hallyn 2010-11-17 15:46 ` Tejun Heo 2010-11-17 22:17 ` Matt Helsley 1 sibling, 1 reply; 111+ messages in thread From: Serge E. Hallyn @ 2010-11-17 15:39 UTC (permalink / raw) To: Tejun Heo Cc: Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel, Matt Helsley, Linux Containers, Eric W. Biederman, xemul Quoting Tejun Heo (tj@kernel.org): > Hello, Oren. > > On 11/07/2010 10:59 PM, Oren Laadan wrote: > > We could work to add ABIs and APIs for each and every possible piece > > of state that affects userspace. And for each we'll argue forever > > about the design and some time later regret that it wasn't designed > > correctly :p > > I'm sorry but in-kernel CR already looks like a major misdesign to me. By this do you mean the very idea of having CR support in the kernel? Or our design of it in the kernel? Let's go back to July 2008, at the containers mini-summit, where it was unanimously agreed upon that the kernel was the right place (Checkpoint/Resetart [CR] under http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that we would start by supporting a single task with no resources. Was that whole discussion effectively misguided, in your opinion? Or do you feel that since the first steps outlined in that discussion we've either "gone too far" or strayed in the subsequent design? -serge ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 15:39 ` Serge E. Hallyn @ 2010-11-17 15:46 ` Tejun Heo 2010-11-18 9:13 ` Pavel Emelyanov ` (2 more replies) 0 siblings, 3 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-17 15:46 UTC (permalink / raw) To: Serge E. Hallyn Cc: Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel, Matt Helsley, Linux Containers, Eric W. Biederman, xemul Hello, Serge. On 11/17/2010 04:39 PM, Serge E. Hallyn wrote: >> I'm sorry but in-kernel CR already looks like a major misdesign to me. > > By this do you mean the very idea of having CR support in the kernel? > Or our design of it in the kernel? The former, I'm afraid. > Let's go back to July 2008, at the containers mini-summit, where it > was unanimously agreed upon that the kernel was the right place > (Checkpoint/Resetart [CR] under > http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that > we would start by supporting a single task with no resources. Was > that whole discussion effectively misguided, in your opinion? Or do > you feel that since the first steps outlined in that discussion > we've either "gone too far" or strayed in the subsequent design? The conclusion doesn't seem like such a good idea, well, at least to me for what it's worth. Conclusions at summits don't carry decisive weight. It'll still have to prove its worthiness for mainline all the same and in light of already working userland alternative and the expanded area now covered by virtualization, the arguments in this thread don't seem too strong. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 15:46 ` Tejun Heo @ 2010-11-18 9:13 ` Pavel Emelyanov 2010-11-18 9:48 ` Tejun Heo 2010-11-18 19:53 ` Oren Laadan 2010-11-19 4:10 ` Serge Hallyn 2 siblings, 1 reply; 111+ messages in thread From: Pavel Emelyanov @ 2010-11-18 9:13 UTC (permalink / raw) To: Tejun Heo Cc: Serge E. Hallyn, Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Matt Helsley, Linux Containers, Eric W. Biederman On 11/17/2010 06:46 PM, Tejun Heo wrote: > Hello, Serge. > > On 11/17/2010 04:39 PM, Serge E. Hallyn wrote: >>> I'm sorry but in-kernel CR already looks like a major misdesign to me. >> >> By this do you mean the very idea of having CR support in the kernel? >> Or our design of it in the kernel? > > The former, I'm afraid. Can you elaborate on this please? >> Let's go back to July 2008, at the containers mini-summit, where it >> was unanimously agreed upon that the kernel was the right place >> (Checkpoint/Resetart [CR] under >> http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that >> we would start by supporting a single task with no resources. Was >> that whole discussion effectively misguided, in your opinion? Or do >> you feel that since the first steps outlined in that discussion >> we've either "gone too far" or strayed in the subsequent design? > > The conclusion doesn't seem like such a good idea, well, at least to > me for what it's worth. Conclusions at summits don't carry decisive > weight. It'll still have to prove its worthiness for mainline all the > same and in light of already working userland alternative and the > expanded area now covered by virtualization, the arguments in this > thread don't seem too strong. > > Thanks. > ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-18 9:13 ` Pavel Emelyanov @ 2010-11-18 9:48 ` Tejun Heo 2010-11-18 20:13 ` Jose R. Santos 2010-11-19 3:54 ` Serge Hallyn 0 siblings, 2 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-18 9:48 UTC (permalink / raw) To: Pavel Emelyanov Cc: Serge E. Hallyn, Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Matt Helsley, Linux Containers, Eric W. Biederman Hello, Pavel. On 11/18/2010 10:13 AM, Pavel Emelyanov wrote: >>> By this do you mean the very idea of having CR support in the kernel? >>> Or our design of it in the kernel? >> >> The former, I'm afraid. > > Can you elaborate on this please? I think I already did that several times in this thread but here's an attempt at summary. * It adds a bunch of pseudo ABI when most of the same information is available via already established ABI. * In a way which can only ever be used and tested by CR. If possible, kernel should provide generic mechanisms which can be used to implement features in userland. One of the reasons why we'd like to export small basic building blocks instead of full end-to-end solutions from the kernel is that we don't know how things will change in the future. In-kernel CR puts too much in the kernel in a way too inflexible manner. * It essentially adds a separate complete set of entry/exit points for a lot of things, which makes things more error prone and increases maintenance overhead across the board. * And, most of all, there are userland implementation and virtualization, making the benefit to overhead ratio completely off. Userland implementation _already_ achieves most of what's necessary for the most important use case of HPC without any special help from the kernel. The only reasonable thing to do is taking a good look at it and finding ways to improve it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-18 9:48 ` Tejun Heo @ 2010-11-18 20:13 ` Jose R. Santos 2010-11-19 3:54 ` Serge Hallyn 1 sibling, 0 replies; 111+ messages in thread From: Jose R. Santos @ 2010-11-18 20:13 UTC (permalink / raw) To: Tejun Heo Cc: Pavel Emelyanov, Serge E. Hallyn, Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Matt Helsley, Linux Containers, Eric W. Biederman On Thu, 18 Nov 2010 10:48:34 +0100 Tejun Heo <tj@kernel.org> wrote: > Hello, Pavel. > > On 11/18/2010 10:13 AM, Pavel Emelyanov wrote: > >>> By this do you mean the very idea of having CR support in the > >>> kernel? Or our design of it in the kernel? > >> > >> The former, I'm afraid. > > > > Can you elaborate on this please? > > I think I already did that several times in this thread but here's an > attempt at summary. Yet the arguments seem to be vague enough not to be convincing to the people working on the code. > * It adds a bunch of pseudo ABI when most of the same information is > available via already established ABI. Can you elaborate on this? What established ABI are you proposing we use here. Hopefully we can turn this into a more technical discussion. > * In a way which can only ever be used and tested by CR. If possible, So what if it can only be tested with CR as long as we can make CR work on a variety of environments? Scalability changes for _really_ large SMP boxes can only be reliably tested by people such equipment. We are not imposing any such restriction and this code can be tested on very wide range of setups. > kernel should provide generic mechanisms which can be used to > implement features in userland. One of the reasons why we'd like to > export small basic building blocks instead of full end-to-end > solutions from the kernel is that we don't know how things will > change in the future. In-kernel CR puts too much in the kernel in a > way too inflexible manner. > > * It essentially adds a separate complete set of entry/exit points for > a lot of things, which makes things more error prone and increases > maintenance overhead across the board. I partially agree with you here. There will be maintenance overhead every time you add code to the kernel that _may_ make changes in the future more complicated. This true for _any_ code that is added to the core kernel. Now in my experience such maintenance burden is most disruptive when the code being added creates a lot of new state that need to be tracked in multiple places unrelated to CR (in this case). Our argument is that the CR code is not creating new state that will cause painful future changes to the kernel. If you have specific example that you are concerned with, great. Lets discuss those. Are we promising zero maintenance cost? But guess what, neither do most features that make into the kernel. Now, if we change the argument around... What would be the maintenance cost keeping this outside the kernel. I would argue that it is much higher and would use SystemTap as the first example that come to mind. > * And, most of all, there are userland implementation and > virtualization, making the benefit to overhead ratio completely off. Can we keep virtualization out of this. Every time someone mentions virtualization as a solution, it makes me feel like these people just don't understand the problem we are trying to solve. It is just not practical to create a new VM for every application you want to CR. These are two different tools to attack two different problems. > Userland implementation _already_ achieves most of what's necessary > for the most important use case of HPC without any special help from What are these _most_ important cases of HPC that you are referring too? Can we do a lot of these cases from userspace? Sure, but why are the ones that can't be done from userspace any less important. If nobody cared about those, we would not be having this conversation. > the kernel. The only reasonable thing to do is taking a good look > at it and finding ways to improve it. The userspace vs in-kernel discussion has been done before as multiple people have already said in this thread. Show me a version of userspace CR that can correctly do all that an in-kernel implementation is capable of. > Thanks. > -- Jose R. Santos ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-18 9:48 ` Tejun Heo 2010-11-18 20:13 ` Jose R. Santos @ 2010-11-19 3:54 ` Serge Hallyn 1 sibling, 0 replies; 111+ messages in thread From: Serge Hallyn @ 2010-11-19 3:54 UTC (permalink / raw) To: Tejun Heo Cc: Pavel Emelyanov, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Eric W. Biederman, Linux Containers Quoting Tejun Heo (tj@kernel.org): > * And, most of all, there are userland implementation and > virtualization, making the benefit to overhead ratio completely off. > Userland implementation _already_ achieves most of what's necessary Guess I'll just be offensive here and say, straight-out: I don't believe it. Can I see the userspace implementation of c/r? If it's as good as the kernel level c/r, then aweseome - we don't need the kernel patches. If it's not as good, then the thing is, we're not drawing arbitrary lines saying "is this good enough", rather we want completely reliable and transparent c/r. IOW, the running task and the other end can't tell that a migration happened, and, if checkpoint says it worked, then restart must succeed. -serge ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 15:46 ` Tejun Heo 2010-11-18 9:13 ` Pavel Emelyanov @ 2010-11-18 19:53 ` Oren Laadan 2010-11-19 4:10 ` Serge Hallyn 2 siblings, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-18 19:53 UTC (permalink / raw) To: Tejun Heo Cc: Serge E. Hallyn, Kapil Arya, Gene Cooperman, linux-kernel, Matt Helsley, Linux Containers, Eric W. Biederman, xemul On 11/17/2010 10:46 AM, Tejun Heo wrote: > Hello, Serge. > > On 11/17/2010 04:39 PM, Serge E. Hallyn wrote: >>> I'm sorry but in-kernel CR already looks like a major misdesign to me. >> >> By this do you mean the very idea of having CR support in the kernel? >> Or our design of it in the kernel? > > The former, I'm afraid. > >> Let's go back to July 2008, at the containers mini-summit, where it >> was unanimously agreed upon that the kernel was the right place >> (Checkpoint/Resetart [CR] under >> http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that >> we would start by supporting a single task with no resources. Was >> that whole discussion effectively misguided, in your opinion? Or do >> you feel that since the first steps outlined in that discussion >> we've either "gone too far" or strayed in the subsequent design? > > The conclusion doesn't seem like such a good idea, well, at least to > me for what it's worth. Conclusions at summits don't carry decisive > weight. It'll still have to prove its worthiness for mainline all the > same and in light of already working userland alternative and the > expanded area now covered by virtualization, the arguments in this > thread don't seem too strong. While it's your opinion that userland alternatives "already work", in reality they are unsuitable for several real use-cases. The userland approach has serious restrictions - which I will cover in a follow-up post to my discussion with Gene soon. Note that one important point of agreement was that DMTCP's ability to provide "glue" to restart applications without their original context is _orthogonal_ to how the core c/r is done. IOW - there exciting goodies from DMTCP are useful with either form of c/r. You also argue that "virtualization" (VMs?) covers everything else, implying that lightweight virtualization is useless. In reality it is an important technology, already in the kernel (surely you don't suggest to pull it out ?!) and for a reason. That is already a very good reason to provide, e.g. containers c/r and live-migration to keep it competitive and useful. Thanks, Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 15:46 ` Tejun Heo 2010-11-18 9:13 ` Pavel Emelyanov 2010-11-18 19:53 ` Oren Laadan @ 2010-11-19 4:10 ` Serge Hallyn 2010-11-19 14:04 ` Tejun Heo 2 siblings, 1 reply; 111+ messages in thread From: Serge Hallyn @ 2010-11-19 4:10 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Linux Containers, Eric W. Biederman Quoting Tejun Heo (tj@kernel.org): > Hello, Serge. Hey Tejun :) > On 11/17/2010 04:39 PM, Serge E. Hallyn wrote: > >> I'm sorry but in-kernel CR already looks like a major misdesign to me. > > > > By this do you mean the very idea of having CR support in the kernel? > > Or our design of it in the kernel? > > The former, I'm afraid. > > > Let's go back to July 2008, at the containers mini-summit, where it > > was unanimously agreed upon that the kernel was the right place > > (Checkpoint/Resetart [CR] under > > http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that > > we would start by supporting a single task with no resources. Was > > that whole discussion effectively misguided, in your opinion? Or do > > you feel that since the first steps outlined in that discussion > > we've either "gone too far" or strayed in the subsequent design? > > The conclusion doesn't seem like such a good idea, well, at least to > me for what it's worth. Conclusions at summits don't carry decisive > weight. Of course. It allows us to present at kernel summit and look for early rejections to save us all some time (which we did, at the container mini-summit readout at ksummit 2008), but it would be silly to read anything more into it than that. > It'll still have to prove its worthiness for mainline all the > same 100% agreed. > and in light of already working userland alternative and the Here's where we disagree. If you are right about a viable userland alternative ('already working' isn't even a preqeq in my opinion, so long as it is really viable), then I'm with you, but I'm not buying it at this point. Seriously. Truly. Honestly. I am *not* looking for any extra kernel work at this moment, if we can help it in any way. > expanded area now covered by virtualization, the arguments in this > thread don't seem too strong. -serge ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 4:10 ` Serge Hallyn @ 2010-11-19 14:04 ` Tejun Heo 2010-11-19 14:36 ` Kirill Korotaev ` (3 more replies) 0 siblings, 4 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-19 14:04 UTC (permalink / raw) To: Serge Hallyn Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Linux Containers, Eric W. Biederman On 11/19/2010 05:10 AM, Serge Hallyn wrote: > Hey Tejun :) Hey, :-) >> and in light of already working userland alternative and the > > Here's where we disagree. If you are right about a viable userland > alternative ('already working' isn't even a preqeq in my opinion, > so long as it is really viable), then I'm with you, but I'm not buying > it at this point. > > Seriously. Truly. Honestly. I am *not* looking for any extra kernel > work at this moment, if we can help it in any way. What's so wrong with Gene's work? Sure, it has some hacky aspects but let's fix those up. To me, it sure looks like much saner and manageable approach than in-kernel CR. We can add nested ptrace, CLONE_SET_PID (or whatever) in pidns, integrate it with various ns supports, add an ability to adjust brk, export inotify state via fdinfo and so on. The thing is already working, the codebase of core part is fairly small and condor is contemplating integrating it, so at least some people in HPC segment think it's already viable. Maybe the HPC cluster I'm currently sitting near is special case but people here really don't run very fancy stuff. In most cases, they're fairly simple (from system POV) C programs reading/writing data and burning a _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp integrated with condor would work well enough for them. Sure, in-kernel CR has better or more reliable coverage now but by how much? The basic things are already there in userland. The tradeoff simply doesn't make any sense. If it were a well separated self sustained feature, it probably would be able to get in, but it's all over the place and requires a completely new concept - the quasi-ABI'ish binary blob which would probably be portable across different kernel versions with some massaging. I personally think the idea is fundamentally flawed (just go through the usual ABI!) but even if it were not it would require _MUCH_ stronger rationale than it currently has to be even considered for mainline inclusion. Maybe it's just me but most of the arguments for in-kernel CR look very weak. They're either about remote toy use cases or along the line that userland CR currently doesn't do everything kernel CR does (yet). Even if it weren't for me, I frankly can't see how it would be included in mainline. I think it would be best for everyone to improve userland CR. A lot of knowdledge and experience gained through kernel CR would be applicable and won't go wasted. Strong resistance against direction change certainly is understandable but IMHO pushing the current direction would only increase loss. I of course could be completely wrong and might end up getting mails filled up with megabytes of "told you so" later, but, well, at this point, in-kernel CR already looks half dead to me. Thank you. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 14:04 ` Tejun Heo @ 2010-11-19 14:36 ` Kirill Korotaev 2010-11-19 15:33 ` Tejun Heo 2010-11-20 18:05 ` Oren Laadan ` (2 subsequent siblings) 3 siblings, 1 reply; 111+ messages in thread From: Kirill Korotaev @ 2010-11-19 14:36 UTC (permalink / raw) To: Tejun Heo Cc: Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers Tejun, Sorry for getting into the middle of the discussion, but... Can you imagine how many userland APIs are needed to make userspace C/R? Do you really want APIs in user-space which allow to: - send signals with siginfo attached (kill() doesn't work...) - read inotify configuration - insert SKB's into socket buffers - setup all TCP/IP parameters for sockets - wait for AIO pending in other processes - setting different statistics counters (like netdev stats etc.) and so on... For every small piece of functionality you will need to export ABI and maintain it forever. It's thousands of APIs! And why the hell they are needed in user space at all? BTW, HPC case you are talking about is probably the simplest one. Last time I looked into it, IBM Meiosis c/r didn't even bother with tty's migration. In OpenVZ we really do need much more then that like autofs/NFS support, preserve statistics, TTYs, etc. etc. etc. Thanks, Kirill On Nov 19, 2010, at 17:04 , Tejun Heo wrote: > On 11/19/2010 05:10 AM, Serge Hallyn wrote: >> Hey Tejun :) > > Hey, :-) > >>> and in light of already working userland alternative and the >> >> Here's where we disagree. If you are right about a viable userland >> alternative ('already working' isn't even a preqeq in my opinion, >> so long as it is really viable), then I'm with you, but I'm not buying >> it at this point. >> >> Seriously. Truly. Honestly. I am *not* looking for any extra kernel >> work at this moment, if we can help it in any way. > > What's so wrong with Gene's work? Sure, it has some hacky aspects but > let's fix those up. To me, it sure looks like much saner and > manageable approach than in-kernel CR. We can add nested ptrace, > CLONE_SET_PID (or whatever) in pidns, integrate it with various ns > supports, add an ability to adjust brk, export inotify state via > fdinfo and so on. > > The thing is already working, the codebase of core part is fairly > small and condor is contemplating integrating it, so at least some > people in HPC segment think it's already viable. Maybe the HPC > cluster I'm currently sitting near is special case but people here > really don't run very fancy stuff. In most cases, they're fairly > simple (from system POV) C programs reading/writing data and burning a > _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp > integrated with condor would work well enough for them. > > Sure, in-kernel CR has better or more reliable coverage now but by how > much? The basic things are already there in userland. The tradeoff > simply doesn't make any sense. If it were a well separated self > sustained feature, it probably would be able to get in, but it's all > over the place and requires a completely new concept - the > quasi-ABI'ish binary blob which would probably be portable across > different kernel versions with some massaging. I personally think the > idea is fundamentally flawed (just go through the usual ABI!) but even > if it were not it would require _MUCH_ stronger rationale than it > currently has to be even considered for mainline inclusion. > > Maybe it's just me but most of the arguments for in-kernel CR look > very weak. They're either about remote toy use cases or along the > line that userland CR currently doesn't do everything kernel CR does > (yet). Even if it weren't for me, I frankly can't see how it would be > included in mainline. > > I think it would be best for everyone to improve userland CR. A lot > of knowdledge and experience gained through kernel CR would be > applicable and won't go wasted. Strong resistance against direction > change certainly is understandable but IMHO pushing the current > direction would only increase loss. I of course could be completely > wrong and might end up getting mails filled up with megabytes of "told > you so" later, but, well, at this point, in-kernel CR already looks > half dead to me. > > Thank you. > > -- > tejun > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 14:36 ` Kirill Korotaev @ 2010-11-19 15:33 ` Tejun Heo 2010-11-19 16:00 ` Alexey Dobriyan 2010-11-20 17:58 ` Oren Laadan 0 siblings, 2 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-19 15:33 UTC (permalink / raw) To: Kirill Korotaev Cc: Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers Hello, On 11/19/2010 03:36 PM, Kirill Korotaev wrote: > Can you imagine how many userland APIs are needed to make userspace C/R? > > Do you really want APIs in user-space which allow to: > - send signals with siginfo attached (kill() doesn't work...) Doesn't rt_sigqueueinfo() already do this? > - read inotify configuration This would be nice even apart from CR. > - insert SKB's into socket buffers Can't we drain kernel buffers? ie. Stop further writing and wait the send-q to drop to zero. > - setup all TCP/IP parameters for sockets I _think_ most can be restored by talking to netfilter module. Setting outgoing sequence number might be beneficial tho. > - wait for AIO pending in other processes I haven't looked at aio implementation for a while now but can't we drain these upon checkpointing and just carry the completion status? Also, if aio is what you're concerned about, I would say the problem is mostly solved. > - setting different statistics counters (like netdev stats etc.) > and so on... Why would this matter? > For every small piece of functionality you will need to export ABI > and maintain it forever. It's thousands of APIs! And why the hell > they are needed in user space at all? I think it's actually quite the contrary. Most things are already visible to userland. They _have_ to be and that's the reason why userland implementation can already get most things working without any change to the kernel with some amount of hackery. To me in-kernel CR seems to approach the problem from the exactly wrong direction - rather than dealing with specific exceptions, it create a completely new framework which is very foreign and not useful outside of CR. Also, think about it. Which one is better? A kernel which can fully show its ABI visible states to userland or one which dumps its internal data structurs in binary blobs. To me, the latter seems multiple orders of magnitude uglier. > BTW, HPC case you are talking about is probably the simplest > one. Yet, it is one of the the most important / relevant use cases. > Last time I looked into it, IBM Meiosis c/r didn't even bother with > tty's migration. In OpenVZ we really do need much more then that > like autofs/NFS support, preserve statistics, TTYs, etc. etc. etc. Would it be impossible to preserve autofs/NFS and TTYs from userland? Then, why so? For statistics, I'm a bit lost. Why does it matter and even if it does would it justify putting the whole CR inside kernel? Thank you. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 15:33 ` Tejun Heo @ 2010-11-19 16:00 ` Alexey Dobriyan 2010-11-19 16:01 ` Alexey Dobriyan 2010-11-19 16:06 ` Tejun Heo 2010-11-20 17:58 ` Oren Laadan 1 sibling, 2 replies; 111+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:00 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 5:33 PM, Tejun Heo <tj@kernel.org> wrote: >> - insert SKB's into socket buffers > > Can't we drain kernel buffers? ie. Stop further writing and wait the > send-q to drop to zero. On send: if network dies right after freeze, you lose. On receive: packets arrive after process freeze, but before network device freeze. >> - setting different statistics counters (like netdev stats etc.) >> and so on... > > Why would this matter? Because you'll introduce million stupid interfaces not interesting to anyone but C/R. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:00 ` Alexey Dobriyan @ 2010-11-19 16:01 ` Alexey Dobriyan 2010-11-19 16:10 ` Tejun Heo 2010-11-19 16:06 ` Tejun Heo 1 sibling, 1 reply; 111+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:01 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:00 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote: >>> - setting different statistics counters (like netdev stats etc.) >>> and so on... >> >> Why would this matter? > > Because you'll introduce million stupid interfaces not interesting to > anyone but C/R. Just like CLONE_SET_PID. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:01 ` Alexey Dobriyan @ 2010-11-19 16:10 ` Tejun Heo 2010-11-19 16:25 ` Alexey Dobriyan 0 siblings, 1 reply; 111+ messages in thread From: Tejun Heo @ 2010-11-19 16:10 UTC (permalink / raw) To: Alexey Dobriyan Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On 11/19/2010 05:01 PM, Alexey Dobriyan wrote: > On Fri, Nov 19, 2010 at 6:00 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote: >>>> - setting different statistics counters (like netdev stats etc.) >>>> and so on... >>> >>> Why would this matter? >> >> Because you'll introduce million stupid interfaces not interesting to >> anyone but C/R. > > Just like CLONE_SET_PID. Well, if you ask me, having pidns w/o a way to reinstate PID from userland is pretty silly and you and I might not know yet but it's quite imaginable that there will be other use cases for the capability unlike in-kernel CR. Kernel provides building blocks not the whole frigging package and for very good reasons. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:10 ` Tejun Heo @ 2010-11-19 16:25 ` Alexey Dobriyan 0 siblings, 0 replies; 111+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:25 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:10 PM, Tejun Heo <tj@kernel.org> wrote: > Well, if you ask me, having pidns w/o a way to reinstate PID from > userland is pretty silly No. Chrome uses CLONE_PID so that exploit couldn't attach to processes in parent pidns. > and you and I might not know yet but it's > quite imaginable that there will be other use cases for the capability > unlike in-kernel CR. Kernel provides building blocks not the whole > frigging package and for very good reasons. Speaking of pids, pid's value itself is never interesing (except maybe pid 1). It's a cookie. CLONE_SET_PID came up only now because only C/R wants it. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:00 ` Alexey Dobriyan 2010-11-19 16:01 ` Alexey Dobriyan @ 2010-11-19 16:06 ` Tejun Heo 2010-11-19 16:16 ` Alexey Dobriyan 1 sibling, 1 reply; 111+ messages in thread From: Tejun Heo @ 2010-11-19 16:06 UTC (permalink / raw) To: Alexey Dobriyan Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers Hello, On 11/19/2010 05:00 PM, Alexey Dobriyan wrote: > On Fri, Nov 19, 2010 at 5:33 PM, Tejun Heo <tj@kernel.org> wrote: >>> - insert SKB's into socket buffers >> >> Can't we drain kernel buffers? ie. Stop further writing and wait the >> send-q to drop to zero. > > On send: > if network dies right after freeze, you lose. Gosh, if you're really worried about that, put a netfilter module which would buffer and simulate acks to extract the packets before initiating freeze. These are fringe problems. Use fringe solutions. > On receive: > packets arrive after process freeze, but before network device freeze. Just store the data somewhere. The checkpointer can drain the socket, right? >>> - setting different statistics counters (like netdev stats etc.) >>> and so on... >> >> Why would this matter? > > Because you'll introduce million stupid interfaces not interesting to > anyone but C/R. In this thread, how many have you guys come up with? Not even a dozen and most can be sovled almost trivially. Seriously, what the hell.. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:06 ` Tejun Heo @ 2010-11-19 16:16 ` Alexey Dobriyan 2010-11-19 16:19 ` Tejun Heo 0 siblings, 1 reply; 111+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:16 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:06 PM, Tejun Heo <tj@kernel.org> wrote: >>>> - setting different statistics counters (like netdev stats etc.) >>>> and so on... >>> >>> Why would this matter? >> >> Because you'll introduce million stupid interfaces not interesting to >> anyone but C/R. > > In this thread, how many have you guys come up with? Not even a dozen > and most can be sovled almost trivially. Seriously, what the hell.. I do not count them. The paragon of absurdity is struct task_struct::did_exec . ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:16 ` Alexey Dobriyan @ 2010-11-19 16:19 ` Tejun Heo 2010-11-19 16:27 ` Alexey Dobriyan 0 siblings, 1 reply; 111+ messages in thread From: Tejun Heo @ 2010-11-19 16:19 UTC (permalink / raw) To: Alexey Dobriyan Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On 11/19/2010 05:16 PM, Alexey Dobriyan wrote: > On Fri, Nov 19, 2010 at 6:06 PM, Tejun Heo <tj@kernel.org> wrote: >>>>> - setting different statistics counters (like netdev stats etc.) >>>>> and so on... >>>> >>>> Why would this matter? >>> >>> Because you'll introduce million stupid interfaces not interesting to >>> anyone but C/R. >> >> In this thread, how many have you guys come up with? Not even a dozen >> and most can be sovled almost trivially. Seriously, what the hell.. > > I do not count them. > > The paragon of absurdity is struct task_struct::did_exec . Yeah, then go and figure how to do that in a way which would be useful for other purposes too instead of trying to shove the whole checkpointer inside the kernel. It sure would be harder but hey that's the way it is. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:19 ` Tejun Heo @ 2010-11-19 16:27 ` Alexey Dobriyan 2010-11-19 16:32 ` Tejun Heo 0 siblings, 1 reply; 111+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:27 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote: >> The paragon of absurdity is struct task_struct::did_exec . > > Yeah, then go and figure how to do that in a way which would be useful > for other purposes too instead of trying to shove the whole > checkpointer inside the kernel. It sure would be harder but hey > that's the way it is. System call for one bit? This is ridiculous. Doing execve(2) for userspace C/R is ridicoulous too (and likely doesn't work). ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:27 ` Alexey Dobriyan @ 2010-11-19 16:32 ` Tejun Heo 2010-11-19 16:38 ` Alexey Dobriyan 0 siblings, 1 reply; 111+ messages in thread From: Tejun Heo @ 2010-11-19 16:32 UTC (permalink / raw) To: Alexey Dobriyan Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On 11/19/2010 05:27 PM, Alexey Dobriyan wrote: > On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote: >>> The paragon of absurdity is struct task_struct::did_exec . >> >> Yeah, then go and figure how to do that in a way which would be useful >> for other purposes too instead of trying to shove the whole >> checkpointer inside the kernel. It sure would be harder but hey >> that's the way it is. > > System call for one bit? This is ridiculous. Why not just a flag in proc entry? It's a frigging single bit. > Doing execve(2) for userspace C/R is ridicoulous too (and likely > doesn't work). Really, whatever. Just keep doing what you're doing. Hey, if it makes you happy, it can't be too wrong. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:32 ` Tejun Heo @ 2010-11-19 16:38 ` Alexey Dobriyan 2010-11-19 16:50 ` Tejun Heo 0 siblings, 1 reply; 111+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:38 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <tj@kernel.org> wrote: > On 11/19/2010 05:27 PM, Alexey Dobriyan wrote: >> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote: >>>> The paragon of absurdity is struct task_struct::did_exec . >>> >>> Yeah, then go and figure how to do that in a way which would be useful >>> for other purposes too instead of trying to shove the whole >>> checkpointer inside the kernel. It sure would be harder but hey >>> that's the way it is. >> >> System call for one bit? This is ridiculous. > > Why not just a flag in proc entry? It's a frigging single bit. Because /proc/*/did_exec useless to anyone but C/R (even for reading!). Because code is much simpler: tsk->did_exec = !!tsk_img->did_exec; + __u8 did_exec; ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:38 ` Alexey Dobriyan @ 2010-11-19 16:50 ` Tejun Heo 2010-11-19 16:55 ` Alexey Dobriyan 0 siblings, 1 reply; 111+ messages in thread From: Tejun Heo @ 2010-11-19 16:50 UTC (permalink / raw) To: Alexey Dobriyan Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On 11/19/2010 05:38 PM, Alexey Dobriyan wrote: > On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <tj@kernel.org> wrote: >> On 11/19/2010 05:27 PM, Alexey Dobriyan wrote: >>> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote: >>>>> The paragon of absurdity is struct task_struct::did_exec . >>>> >>>> Yeah, then go and figure how to do that in a way which would be useful >>>> for other purposes too instead of trying to shove the whole >>>> checkpointer inside the kernel. It sure would be harder but hey >>>> that's the way it is. >>> >>> System call for one bit? This is ridiculous. >> >> Why not just a flag in proc entry? It's a frigging single bit. > > Because /proc/*/did_exec useless to anyone but C/R (even for reading!). I don't think you'll need a full file. Just shove it in status or somewhere. Your argument is completely absurd. So, because exporting single bit is so horrible to everyone else, you want to shove the whole frigging checkpointer inside the kernel? > Because code is much simpler: > > tsk->did_exec = !!tsk_img->did_exec; > + > __u8 did_exec; Sigh, yeah, except for the horror show to create tsk_img. Your "paragon of absurdity" is did_exec which is only ever used to decide whether setpgid() should fail with -EACCES, seriously? Here's a thought. Ignore it for now and concentrate on more relevant problems. I'm fairly sure CR'd program malfunctioning over did_exec wouldn't mark the beginning of the end of our civilization. You gotta be kidding me. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:50 ` Tejun Heo @ 2010-11-19 16:55 ` Alexey Dobriyan 0 siblings, 0 replies; 111+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:55 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:50 PM, Tejun Heo <tj@kernel.org> wrote: > On 11/19/2010 05:38 PM, Alexey Dobriyan wrote: >> On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <tj@kernel.org> wrote: >>> On 11/19/2010 05:27 PM, Alexey Dobriyan wrote: >>>> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote: >>>>>> The paragon of absurdity is struct task_struct::did_exec . >>>>> >>>>> Yeah, then go and figure how to do that in a way which would be useful >>>>> for other purposes too instead of trying to shove the whole >>>>> checkpointer inside the kernel. It sure would be harder but hey >>>>> that's the way it is. >>>> >>>> System call for one bit? This is ridiculous. >>> >>> Why not just a flag in proc entry? It's a frigging single bit. >> >> Because /proc/*/did_exec useless to anyone but C/R (even for reading!). > > I don't think you'll need a full file. Just shove it in status or > somewhere. Your argument is completely absurd. So, because exporting > single bit is so horrible to everyone else, you want to shove the > whole frigging checkpointer inside the kernel? > >> Because code is much simpler: >> >> tsk->did_exec = !!tsk_img->did_exec; >> + >> __u8 did_exec; > > Sigh, yeah, except for the horror show to create tsk_img. task_struct image work is common for both userspace C/R and in-kernel. You _have_ to define it. Simpler code is only first line. > Your "paragon of absurdity" is did_exec which is only ever used > to decide whether setpgid() should fail with -EACCES, seriously? > Here's a thought. Ignore it for now and concentrate on more > relevant problems. You're so newjerseyly now. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 15:33 ` Tejun Heo 2010-11-19 16:00 ` Alexey Dobriyan @ 2010-11-20 17:58 ` Oren Laadan 1 sibling, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-20 17:58 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Kapil Arya, Pavel Emelianov, Gene Cooperman, linux-kernel@vger.kernel.org, Eric W. Biederman, Linux Containers On Fri, 19 Nov 2010, Tejun Heo wrote: > Hello, > > On 11/19/2010 03:36 PM, Kirill Korotaev wrote: > > Can you imagine how many userland APIs are needed to make userspace C/R? > > > > Do you really want APIs in user-space which allow to: > > - send signals with siginfo attached (kill() doesn't work...) > > Doesn't rt_sigqueueinfo() already do this? > You assume that c/r is done by the checkpointed processes _themselves_, that is that to checkpoint a process that process need to be made runnable and it will save its own state (which is the model of dmtcp, but not of using ptrace). This model is restrictive: it requires that you hijack the execution of that process somehow and make it run. What if the process isn't runnable (e.g. in vfork waiting for completion, or ptraced deep in the kernel) ? letting it run even just a bit may modify its state. It also means that if you have many processes in the checkpointed session, e.g. 1000, then _all_ of them will have to be scheduled to run ! With kernel c/r this is unnecessary: you can use an auxiliary process to checkpoint other processes without scheduling the other processes. I.e. it's _transparent_ and _preemptive_. Another advantage is that if anything fails during checkpoint (for whatever reason), there are no side-effects (which is not the case with the other method). > > For every small piece of functionality you will need to export ABI > > and maintain it forever. It's thousands of APIs! And why the hell > > they are needed in user space at all? > > I think it's actually quite the contrary. Most things are already > visible to userland. They _have_ to be and that's the reason why > userland implementation can already get most things working without > any change to the kernel with some amount of hackery. To me in-kernel > CR seems to approach the problem from the exactly wrong direction - > rather than dealing with specific exceptions, it create a completely > new framework which is very foreign and not useful outside of CR. > > Also, think about it. Which one is better? A kernel which can fully > show its ABI visible states to userland or one which dumps its > internal data structurs in binary blobs. To me, the latter seems > multiple orders of magnitude uglier. Are we jusding aesteics ? To me the former looks uglier... The amount of fragile hacks you need to go through to make it work in userspace for the generic cases (including userspace trickery and new crazy APIs from the kernel for state that was never even an ABI, like skb's), and the restrictions it posses simply suggest that userspace is not the right place to do it. Thanks, Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 14:04 ` Tejun Heo 2010-11-19 14:36 ` Kirill Korotaev @ 2010-11-20 18:05 ` Oren Laadan 2010-11-20 18:08 ` Oren Laadan 2010-11-20 18:11 ` Oren Laadan 3 siblings, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-20 18:05 UTC (permalink / raw) To: Tejun Heo Cc: Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel, xemul, Eric W. Biederman, Linux Containers Hi, Based on discussion with Gene, I'd like to clarify key points and difference between kernel and userspace approaches (specifically linux-cr and dmtcp): three parts to break the long post... part I: perpsectice about the types of scopes of c/r in discussion part II: linux-cr design adn objectives part III: comparison kernel/userspace approaches [now relax, grab (another) cup of coffee and read on...] PART I: ==PERSPECTIVE== A rough classification of c/r categories: * container-c/r: important use-case, e.g. c/r and migration of an application containers like VPS (virtual private server), VDI (desktop) or other self-contained application (e.g. Oracle server). Here _all_ the relevant processes are included in the checkpoint. * standalone-c/r: another use-case is standalone-c/r where a set of processes is checkpointed, but not the entire environment, and then those processes are restarted in a different "eco-system". * distributed-c/r: meaning several sets of processes, each running on a different host. (Each set may be a separate container there). In container-c/r, the main challenge is to be _reliable_ in the sense that a restart from a successful checkpoint should always succeed. In standalone-c/r, the main challenge is that an application resumes execution after a restart in a possible _different_ eco-system. Some application don't care (e.g 'bc'). Other applications do care, and to different degrees; for these we need "glue" to pacify the application. There are generally three types of "glue": (1) Modify the application or selected libraries to be c/r-aware, and notify it when restart completes. (e.g. CoCheck MPI library). (2) Add a userspace helper that will run post-restart to do necessary trickery (eg. send a SIGWINCH to 'screen'; mount proper filesystem at the new host after migration; reconnect a socket to a peer). (3) Use interposition on selected library calls and add wrapper code that will glue in what's missing (e.g. dbus or nscd calls to reconnect an application to those services). IMPORTANT: the glueing method is _orthogonal_ to how the c/r is done ! We are strictly discussion the core c/r functionality. (next part: linux-cr philosophy...) Thanks, Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 14:04 ` Tejun Heo 2010-11-19 14:36 ` Kirill Korotaev 2010-11-20 18:05 ` Oren Laadan @ 2010-11-20 18:08 ` Oren Laadan 2010-11-20 18:11 ` Oren Laadan 3 siblings, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-20 18:08 UTC (permalink / raw) To: Tejun Heo Cc: Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel, xemul, Eric W. Biederman, Linux Containers login as: orenl Using keyboard-interactive authentication. Password: Access denied Using keyboard-interactive authentication. Password: Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il 499:takamine[~]$ pine PINE 4.64 COMPOSE MESSAGE Folder: Drafts 8 Messages + To : Tejun Heo <tj@kernel.org> Cc : Serge Hallyn <serge.hallyn@canonical.com>, Kapil Arya <kapil@ccs.neu.edu>, Gene Cooperman <gene@ccs.neu.edu>, linux-kernel@vger.kernel.org, xemul@sw.ru, "Eric W. Biederman" <ebiederm@xmission.com>, Linux Containers <containers@lists.osdl.org> Attchmnt: Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch ----- Message Text ----- Hi, [continuation of posting regarding kernel vs userspace approach] part I: perpsectice about the types of scopes of c/r in discussion part II: linux-cr design adn objectives part III: comparison kernel/userspace approaches PART II: ==PHILOSOPHY== Linux-cr is a _generic_ c/r-engine with multiple capabilities. It can checkpoint a full container, a process hierarchy, or a single process, For containers, it provides guarantees like restart-ability; For the others, it provides the flexibility so that c/r-aware applications, libraries, helpers, and wrappers can glue what they wish to glue. 1) Transparent - completely transparent for container-c/r, and largely so for standalone-cr ("largely" - as in except for the glue which is needed due to loss of eco-system, not due to restarting). 2) Reliable - if checkpoint succeeds that it is guaranteed for to succeed too (for container-c/r). 3) Preemtptive - works without requiring that checkpointed processes be scheduled to run (and thus "collaborate") 4) Complete - covers all visible and hidden state in the kernel about processes (even if not directly visible to userspace) 5) Efficient - can be optimized along multiple axes: _zero_ impact on runtime, low downtime during checkpoint, partial and incremental checkpoint, live-migration, etc. 6) Flexible - can integrate nicely with different userspace "glueing" methods. 7) Maintainable - small part of the code is to refactor kernel code so that it can be reused in restart; the rest is new code that in our experience rarely changes. Same hods for the image format. What linux-cr _does not_ do in the kernel, nor plans to support is: 1) Hardware devices: their state is per-device/vendor. Instead one should use virtual devices (VNC for dislpay, pulseaudio for sound, screen for ttys), or have a userspace glue to restore the state of the device. That said, in the future vendors may opt to provide logic for c/r in drivers, e.g. ->checkpoint, ->restart methods. 2) Userspace glue: (as defined for standalone-c/r above) the kernel knows about processes and their state, not about their intentions. We leave that for userspace. 3) External dependencies: (outside of the local host) the kernel does not control what's outside the host. That is the responsibility of userspace. (Even with live-migration, the linux-cr only restores the local state of the TCP connections). Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 14:04 ` Tejun Heo ` (2 preceding siblings ...) 2010-11-20 18:08 ` Oren Laadan @ 2010-11-20 18:11 ` Oren Laadan 2010-11-20 18:15 ` Oren Laadan 3 siblings, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-20 18:11 UTC (permalink / raw) To: Tejun Heo Cc: Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel, xemul, Linux Containers, Eric W. Biederman login as: orenl Using keyboard-interactive authentication. Password: Access denied Using keyboard-interactive authentication. Password: Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il 499:takamine[~]$ pine PINE 4.64 COMPOSE MESSAGE Folder: Drafts 8 Messages + To : Tejun Heo <tj@kernel.org> Cc : Serge Hallyn <serge.hallyn@canonical.com>, Kapil Arya <kapil@ccs.neu.edu>, Gene Cooperman <gene@ccs.neu.edu>, linux-kernel@vger.kernel.org, xemul@sw.ru, "Eric W. Biederman" <ebiederm@xmission.com>, Linux Containers <containers@lists.osdl.org> Fcc : imap://ol2104@mail.columbia.edu/Sent Attchmnt: Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch ----- Message Text ----- Hi, [continuation of discussion of kernel vs userspace c/r approach] part I: perpsectice about the types of scopes of c/r in discussion part II: linux-cr design adn objectives part III: comparison kernel/userspace approaches PART III: ==SOME TECHNICAL ASPECTS== Important to know about userspace (DMTCP example) before presenting a comparison between kernel and userspace approaches: DMTCP has two components: 1) c/r-engine to save/restore process state, and 2) glue to restart processes out of their original context. They are _orthogonal_: the glue can be used with of other c/r-engines, like linux-cr. This discussion refers to the c/r-engine _only_. Focusing on the c/r-engine of DMTCP - it uses syscall interposition for three reasons: 1) To take control of processes at checkpoint 2) To always track state of resources not visible to userspace 3) To virtualize identifiers after restart #1 is needed because processes saves their own state (and need to run the checkpoint code for that). #2 is needed because the kernel does not expose all state, and #3 is needed because the kernel does not give ways to restore all state. So these two logics are used to mirror in userspace functionality that already exists in the kernel. The main advantages of the approach: (a) portability to other system (like BSD), though with considerable effort (b) it's "good enough" for several use-cases, without kernel changes. Putting the c/r-engine in the kernel provides many advantages, which I summarize in the following table: category linux-cr userspace -------------------------------------------------------------------------------- PERFORMANCE has _zero_ runtime overhead visible overhead due to syscalls interposition and state tracking even w/o checkpoints; OPTIMIZATIONS many optimizations possible limited, less effective only in kernel, for downtime, w/ much larger overhead. image size, live-migration OPERATION applications run unmodified to do c/r, needs 'controller' task (launch and manage _entire_ execution) - point of failure. restricts how a system is used. PREEMPTIVE checkpoint at any time, use processes must be runnable and auxiliary task to save state; "collaborate" for checkpoint; non-intrusive: failure does long task coordination time not impact checkpointees. with many tasks/threads. alters state of checkpointee if fails. e.g. cannot checkpoint when in vfork(), ptrace states, etc. COVERAGE save/restore _all_ task state; needs new ABI for everything: identify shared resources; can expose state, provide means to extend for new kernel features restore state (e.g. TCP protocol easily options negotiated with peers) RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks atomic operation. guaranteed to determine restartability restartability for containers USERSPACE GLUE possible possible SECURITY root and non-root modes root and non-root modes native support for LSM MAINTENANCE changes mainly for features changes mainly for features; create new ABI for features I'm not saying Gene's work isn't good - on the contrary, it's a fine piece of engineering. However, the part of it that does c/r poses many constraints that limits the generality, mode of use, and performance of the whole. That may be enough for Tejun, for your cluster. But not for other users of the technology. And by all means, I intend to cooperate with Gene to see how to make the other part of DMTCP, namely the userspace "glue", work on top of linux-cr to have the benefits of all worlds ! All in all, kernel c/r is far more generic and less restrictive than userspace, can provide nice guarantees, and has superior performance. It can do everything the a userspace c/r can do, and much more - and that "much more" is crucial for important use cases. Last word about maintenance - once the core code is in mainline (which means a code "spike"), experience (both kernel/userspace) shows that both code and image format hardly change. The format is tied to specific set of features supported (i.e. kernel versions) so that the kernel does not need to maintain backward compatibility. Thanks, Oren ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-20 18:11 ` Oren Laadan @ 2010-11-20 18:15 ` Oren Laadan 2010-11-20 19:33 ` Tejun Heo 0 siblings, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-20 18:15 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Eric W. Biederman, Linux Containers [[apologies for the silly prefix on last two posts - a combination of windows, putty, pine andslow connection is not helping me :( ]] ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-20 18:15 ` Oren Laadan @ 2010-11-20 19:33 ` Tejun Heo 2010-11-21 8:18 ` Gene Cooperman 2010-11-22 17:18 ` Oren Laadan 0 siblings, 2 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-20 19:33 UTC (permalink / raw) To: Oren Laadan Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Eric W. Biederman, Linux Containers Hello, On 11/20/2010 07:15 PM, Oren Laadan wrote: > > [[apologies for the silly prefix on last two posts - a combination > of windows, putty, pine andslow connection is not helping me :( ]] Maybe it's a good idea to post a clean concatenated version for later reference? Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-20 19:33 ` Tejun Heo @ 2010-11-21 8:18 ` Gene Cooperman 2010-11-21 8:21 ` Gene Cooperman ` (2 more replies) 2010-11-22 17:18 ` Oren Laadan 1 sibling, 3 replies; 111+ messages in thread From: Gene Cooperman @ 2010-11-21 8:18 UTC (permalink / raw) To: Tejun Heo Cc: Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel, xemul, Eric W. Biederman, Linux Containers, Gene Cooperman In this post, Kapil and I will provide our own summary of how we see the issues for discussion so far. In the next post, we'll reply specifically to comment on Oren's table of comparison between linux-cr and userspace. In general, we'd like to add that the conversation with Oren was very useful for us, and I think Oren will also agree that we were able to converge on the purely technical questions. Concerning opinions, we want to be cautious on opinions, since we're still learning the context of this ongoing discussion on LKML. There is probably still some context that we're missing. Below, we'll summarize the four major questions that we've understood from this discussion so far. But before doing so, I want to point out that a single process or process tree will always have many possible interactions with the rest of the world. Within our own group, we have an internal slogan: "You can't checkpoint the world." A virtual machine can have a relatively closed world, which makes it more robust, but checkpointing will always have some fragile parts. We give four examples below: a. time virtualization b. external database c. NSCD daemon d. screen and other full-screen text programs These are not the only examples of difficult interactions with the rest of the world. Anyway, in my opinion, the conversation with Oren seemed to converge into two larger cases: 1. In a pure userland C/R like DMTCP, how many corner cases are not handled, or could not be handled, in a pure userland approach? Also, how important are those corner cases? Do some have important use cases that rise above just a corner case? [ inotify is one of those examples. For DMTCP to support this, it would have to put wrappers around inotify_add_watch, inotify_rm_watch, read, etc., and maybe even tracking inodes in case the file had been renamed after the inotify_add_watch. Something could be made to work for the common cases, but it would still be a hack --- to be done only if a use case demands it. ] 2. In a Linux C/R approach, it's already recognized that one needs a userland component (for example, for convenience of recreating the process tree on restart). How many other cases are there that require a userland component? [ One example here is the shared memory segment of NSCD, which has to be re-initialized on restart. Another example is a screen process that talks to an ANSI terminal emulator (e.g. gnome-terminal), which talks to an X server or VNC server. Below, we discuss these examples in more detail. ] One can add a third and fourth question here: 3. [Originally posed by Oren] Given Linux C/R, how much work would it be to add the higher layers of DMTCP on top of Linux C/R? [ This is a non-trivial question. As just one example, DMTCP handles sockets uniformly, regardless of whether they are intra-host or inter-host. Linux C/R handles certain types of intra-host sockets. So, merging the two would require some thought. ] 4. [Originally posed by Tejun, e.g. Fri Nov 19 2010 - 09:04:42 EST] Given that DMTCP checkpoints many common applications, how much work would it be to add a small number of restricted kernel interfaces to enable one to remove some of the hacks in DMTCP, and to cover the more important corner cases that DMTCP might be missing? I'd also like to add some points of my own here. First, there are certain cases where I believe that a checkpoint-restart system (in-kernel or userland or hybrid) can never be completely transparent. It's because you can't completely cut the connection with the rest of the world. In these examples, I'm thinking primarily of the Linux C/R mode used to checkpoint a tree of processes. To the extent that Linux C/R is used with containers, it seems to me to be closer to lightweight virtualization. From there, I've seen that the conversation goes to comparing lightweight virtualization versus traditional virtual machines, but that discussion goes beyond my own personal expertise. Here are some examples that I believe that every checkpointing system would suffer from the syndrome of trying to "checkpoint the world". 1. Time virtualization --- Right now, neither system does time virtualization. Both systems could do it. But what is the right policy? For example, one process may set a deadline for a task an hour in the future, and then periodically poll the kernel for the current time to see if one hour has passed. This use case seems to require time virtualization. A second process wants to know the current day and time, because a certain web service updates its information at midnight each day. This use case seems seems to argue that time virtualization is bad. 2. External database file on another host --- It's not possible to checkpoint the remote database file. In our work with the Condor developers, they asked us to add a "Condor mode", which says that if there are any external socket connections, then delay the checkpoint until the external socket connections are closed. In a different joint project with CERN (Geneva), we considered a checkpointing application in which an application saves much of the database, and then on restart, discovers how much of its data is stale, and re-loads only the stale portion. 3. NSCD (Network Services Caching Daemon) --- Glibc arranges for certain information to be cached in the NSCD. The information is in a memory segment shared between the NSCD and the application. Upon restart, the application doesn't know that the memory segment is no longer shared with the NSCD, or that the information is stale. The DMTCP "hack" is to zero out this memory page on restart. Then glibc recognizes that it needs to create a new shared memory segment. 3. screen --- The screen application sets the scrolling region of its ANSI terminal emulator, in order to create a status line at the bottom, while scrolling the remaining lines of the terminal. Upon restart, screen assumes that the scrolling region has already been set up, and doesn't have to be re-initialized. So, on restart, DMTCP uses SIGWINCH to fool screen (or any full-screen text-based application) into believing that its window size has been changed. So, screen (or vim, or emacs) then re-initializes the state of its ANSI terminal, including scrolling regions and so on. So, a userland component is helpful in doing the kind of hacks above. I recognize that the Linux C/R team agrees that some userland component can be useful. I just want to show why some userland hacks will always be needed. Let's consider a pure in-kernel approach to checkpointing 'screen' (or almost any full-screen application that uses a status bar at the bottom). Screen sets the scrolling region of an ANSI terminal emulator, which might be a gnome-terminal. So, a pure in-kernel approach needs to also checkpoint the gnome-terminal. But the gnome-terminal needs to talk to an X server. So, now one also needs to start up inside a VNC server to emulate the X server. So, either one adds a "hack" in userland to force screen to re-initialize its ANSI terminal emulator, or else one is forced to include an entire VNC server just to checkpoint a screen process. ] Finally, this excerpt below from Tejun's post sums up our views too. We don't have the kernel expertise of the people on this list, but we've had to do a little bit of reading the kernel code where the documentation was sparse and in teaching O/S. We would certainly be very happy to work closely with the kernel developers, if there was interest in extending DMTCP to directly use more kernel support. - Gene and Kapil Tejun Heo wrote Fri Nov 19 2010 - 09:04:42 EST > What's so wrong with Gene's work? Sure, it has some hacky aspects but > let's fix those up. To me, it sure looks like much saner and > manageable approach than in-kernel CR. We can add nested ptrace, > CLONE_SET_PID (or whatever) in pidns, integrate it with various ns > supports, add an ability to adjust brk, export inotify state via > fdinfo and so on. > > The thing is already working, the codebase of core part is fairly > small and condor is contemplating integrating it, so at least some > people in HPC segment think it's already viable. Maybe the HPC > cluster I'm currently sitting near is special case but people here > really don't run very fancy stuff. In most cases, they're fairly > simple (from system POV) C programs reading/writing data and burning a > _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp > integrated with condor would work well enough for them. > > Sure, in-kernel CR has better or more reliable coverage now but by how > much? The basic things are already there in userland. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-21 8:18 ` Gene Cooperman @ 2010-11-21 8:21 ` Gene Cooperman 2010-11-22 18:02 ` Sukadev Bhattiprolu 2010-11-23 17:53 ` Oren Laadan 2010-11-21 22:41 ` Grant Likely 2010-11-22 17:34 ` Oren Laadan 2 siblings, 2 replies; 111+ messages in thread From: Gene Cooperman @ 2010-11-21 8:21 UTC (permalink / raw) To: Gene Cooperman Cc: Tejun Heo, Oren Laadan, Kapil Arya, linux-kernel, xemul, Eric W. Biederman, Linux Containers As Kapil and I wrote before, we benefited greatly from having talked with Oren, and learning some more about the context of the discussion. We were able to understand better the good technical points that Oren was making. Since the comparison table below concerns DMTCP, we'd like to state some additional technical points that could affect the conlusions. > category linux-cr userspace > -------------------------------------------------------------------------------- > PERFORMANCE has _zero_ runtime overhead visible overhead due to syscalls > interposition and state tracking > even w/o checkpoints; In our experiments so far, the overhead of system calls has been unmeasurable. We never wrap read() or write(), in order to keep overhead low. We also never wrap pthread synchronization primitives such as locks, for the same reason. The other system calls are used much less often, and so the overhead has been too small to measure in our experiments. > OPTIMIZATIONS many optimizations possible limited, less effective > only in kernel, for downtime, w/ much larger overhead. > image size, live-migration As above, we believe that the overhead while running is negligible. I'm assuming that image size refers to in-kernel advantages for incremental checkpointing. This is useful for apps where the modified pages tend not to dominate. We agree with this point. As an orthogonal point, by default DMTCP compresses all checkpoint images using gzip on the fly. This is useful even when most pages are modified between checkpoints. Still, as Oren writes, Linux C/R could also add a userland component to compress checkpoint images on the fly. Next, live migration is a question that we simply haven't thought much about. If it's important, we could think about what userland approaches might exist, but we have no near-term plans to tackle live migration. > OPERATION applications run unmodified to do c/r, needs 'controller' > task (launch and manage _entire_ > execution) - point of failure. > restricts how a system is used. We'd like to clarify what may be some misconceptions. The DMTCP controller does not launch or manage any tasks. The DMTCP controller is stateless, and is only there to provide a barrier, namespace server, and single point of contact to relay ckpt/restart commands. Recall that the DMTCP controller handls processes across hosts --- not just on a single host. Also, in any computation involving multiple processes, _every_ process of the computation is a point of failure. If any process of the computation dies, then the simple application strategy is to give up and revert to an earlier checkpoint. There are techniques by which an app or DMTCP can recreate certain failed processes. DMTCP doesn't currently recreate a dead controller (no demand for it), but it's not hard to do technically. > PREEMPTIVE checkpoint at any time, use processes must be runnable and > auxiliary task to save state; "collaborate" for checkpoint; > non-intrusive: failure does long task coordination time > not impact checkpointees. with many tasks/threads. alters > state of checkpointee if fails. > e.g. cannot checkpoint when in > vfork(), ptrace states, etc. Our current support of vfork and ptrace has some of the issues that Oren points out. One example occurs if a process is in the kernel, and a ptrace state has changed. If it was important for some application, we would either have to think of some "hack", or follow Tejun's alternative suggestion to work with the developers to add further kernel support. The kernel developers on this list can estimate the difficulties of kernel support better than I can. > COVERAGE save/restore _all_ task state; needs new ABI for everything: > identify shared resources; can expose state, provide means to > extend for new kernel features restore state (e.g. TCP protocol > easily options negotiated with peers) Currently, the only kernel support used by DMTCP is system calls (wrappers), /proc/*/fd, /proc/*/maps, /proc/*/cmdline, /proc/*/exe, /proc/*/stat. (I think I've named them all now.) The kernel developers will know better than us what other kernel state one might want to support for C/R, and what types of applications would need that. > RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks > atomic operation. guaranteed to determine restartability > restartability for containers My understanding is that the guarantees apply for Linux containers, but not for a tree of processes. Does this imply that linux-cr would have some of the same reliability issues as DMTCP for a tree of processes? (I mean the question sincerely, and am not intending to be rude.) In any case, won't DMTCP and Linux C/R have to handle orthogonal reliability issues such as external database, time virtualization, and other examples from our previous post? > USERSPACE GLUE possible possible > > SECURITY root and non-root modes root and non-root modes > native support for LSM > > MAINTENANCE changes mainly for features changes mainly for features; > create new ABI for features > iAnd by all means, I intend to cooperate with Gene to see how to > make the other part of DMTCP, namely the userspace "glue", work on > top of linux-cr to have the benefits of all worlds ! This is true, and we strongly welcome the cooperation. We don't know how this experiment will turn out, but the only way to find out is to sincerely try it. Whether we succeed or fail, we will learn something either way! - Gene and Kapil ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-21 8:21 ` Gene Cooperman @ 2010-11-22 18:02 ` Sukadev Bhattiprolu 2010-11-23 17:53 ` Oren Laadan 1 sibling, 0 replies; 111+ messages in thread From: Sukadev Bhattiprolu @ 2010-11-22 18:02 UTC (permalink / raw) To: Gene Cooperman Cc: Kapil Arya, linux-kernel, xemul, Linux Containers, Eric W. Biederman, Tejun Heo Gene Cooperman [gene@ccs.neu.edu] wrote: | > RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks | > atomic operation. guaranteed to determine restartability | > restartability for containers | | My understanding is that the guarantees apply for Linux containers, but not | for a tree of processes. Does this imply that linux-cr would have some | of the same reliability issues as DMTCP for a tree of processes? (I mean | the question sincerely, and am not intending to be rude.) In any case, | won't DMTCP and Linux C/R have to handle orthogonal reliability issues | such as external database, time virtualization, and other examples | from our previous post? Yes if the user attempts to checkpoint a partial container (what we refer to process subtree) or fails to snapshot/restore filesystem there could be leaks that we cannot detect. But one guarantee we are trying to provide is that if the user checkpoints a _complete_ container, then we will detect a leak if one exists. Is there a way to establish a set of constraints (eg: run application in a container, snapshot/restore filesystem) and then provide leak detection with a pure userpsace implementation ? Sukadev ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-21 8:21 ` Gene Cooperman 2010-11-22 18:02 ` Sukadev Bhattiprolu @ 2010-11-23 17:53 ` Oren Laadan 2010-11-24 3:50 ` Kapil Arya 1 sibling, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-23 17:53 UTC (permalink / raw) To: Gene Cooperman Cc: Tejun Heo, Kapil Arya, linux-kernel, xemul, Eric W. Biederman, Linux Containers On Sun, 21 Nov 2010, Gene Cooperman wrote: > As Kapil and I wrote before, we benefited greatly from having talked with Oren, > and learning some more about the context of the discussion. We were able > to understand better the good technical points that Oren was making. > Since the comparison table below concerns DMTCP, we'd like to > state some additional technical points that could affect the conlusions. > > > category linux-cr userspace > > -------------------------------------------------------------------------------- > > PERFORMANCE has _zero_ runtime overhead visible overhead due to syscalls > > interposition and state tracking > > even w/o checkpoints; > > In our experiments so far, the overhead of system calls has been > unmeasurable. We never wrap read() or write(), in order to keep overhead low. > We also never wrap pthread synchronization primitives such as locks, > for the same reason. The other system calls are used much less often, and so > the overhead has been too small to measure in our experiments. Syscall interception will have visible effect on applications that use those syscalls. You may not observe overheasd with HPC ones, but do you have numbers on server apps ? apps that use fork/clone and pipes extensively ? threads benchmarks et ? compare that to aboslute zero overhead of linux-cr. > > > OPTIMIZATIONS many optimizations possible limited, less effective > > only in kernel, for downtime, w/ much larger overhead. > > image size, live-migration > > As above, we believe that the overhead while running is negligible. I'm For the HPC apps that you use. > assuming that image size refers to in-kernel advantages for incremental > checkpointing. This is useful for apps where the modified pages tend > not to dominate. We agree with this point. As an orthogonal point, > by default DMTCP compresses all checkpoint images using gzip on the fly. > This is useful even when most pages are modified between checkpoints. > Still, as Oren writes, Linux C/R could also add a userland component > to compress checkpoint images on the fly. This is not "userland component", it's "checkpoint | gzip > image.out"... > Next, live migration is a question that we simply haven't thought much > about. If it's important, we could think about what userland approaches might > exist, but we have no near-term plans to tackle live migration. As it is, live-migration _is_ a very important use case. > > > OPERATION applications run unmodified to do c/r, needs 'controller' > > task (launch and manage _entire_ > > execution) - point of failure. > > restricts how a system is used. > > We'd like to clarify what may be some misconceptions. The DMTCP > controller does not launch or manage any tasks. The DMTCP controller > is stateless, and is only there to provide a barrier, namespace server, > and single point of contact to relay ckpt/restart commands. Recall that > the DMTCP controller handls processes across hosts --- not just on a > single host. The controller is another point of failure. I already pointed that the (controlled) application crashes when your controller dies, and you mentioned it's a bug that should be fixed. But then there will always be a risk for another, and another ... You also mentioned that if the controller dies, then the app should contionue to run, but will not be checkpointable anymore (IIUC). The point is, that the controller is another point of failure, and makes the execution/checkpoint intrusive. It also adds security and user-management issues as you'll need one (or more ?) controller per user (right now, it's one for all, no ?). and so on. Plus, because the restarted apps get their virtualized IDs from the controller, then they can't now "see" existing/new processes that may get the "same" pids (virtualization is not in the kernel). > Also, in any computation involving multiple processes, _every_ process > of the computation is a point of failure. If any process of the computation > dies, then the simple application strategy is to give up and revert to an > earlier checkpoint. There are techniques by which an app or DMTCP can > recreate certain failed processes. DMTCP doesn't currently recreate > a dead controller (no demand for it), but it's not hard to do technically. The point is that you _add_ a point of failure: you make the "checkpoint" operation a possible reason for the application to crash. In contrast, in linux-cr the checkpoiint is idempotent - nunharmful because it does not make the applications execute. Instead, it merely observes their state. > > PREEMPTIVE checkpoint at any time, use processes must be runnable and > > auxiliary task to save state; "collaborate" for checkpoint; > > non-intrusive: failure does long task coordination time > > not impact checkpointees. with many tasks/threads. alters > > state of checkpointee if fails. > > e.g. cannot checkpoint when in > > vfork(), ptrace states, etc. > > Our current support of vfork and ptrace has some of the issues that Oren points > out. One example occurs if a process is in the kernel, and a ptrace state has > changed. If it was important for some application, we would either have > to think of some "hack", or follow Tejun's alternative suggestion to work > with the developers to add further kernel support. The kernel developers > on this list can estimate the difficulties of kernel support better than I can. > > > COVERAGE save/restore _all_ task state; needs new ABI for everything: > > identify shared resources; can expose state, provide means to > > extend for new kernel features restore state (e.g. TCP protocol > > easily options negotiated with peers) > > Currently, the only kernel support used by DMTCP is system calls (wrappers), > /proc/*/fd, /proc/*/maps, /proc/*/cmdline, /proc/*/exe, /proc/*/stat. (I think > I've named them all now.) The kernel developers will know better > than us what other kernel state one might want to support for C/R, and what > types of applications would need that. > > > RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks > > atomic operation. guaranteed to determine restartability > > restartability for containers > > My understanding is that the guarantees apply for Linux containers, but not > for a tree of processes. Does this imply that linux-cr would have some > of the same reliability issues as DMTCP for a tree of processes? (I mean > the question sincerely, and am not intending to be rude.) In any case, > won't DMTCP and Linux C/R have to handle orthogonal reliability issues > such as external database, time virtualization, and other examples > from our previous post? There are two points in the claim above: 1) linux-cr can checkpoint with a single syscall - it's atomic. This gives you more guarantees about the consistency of the checkpointed application(s), and less "opportunitites" for the operation as a whole to fail. 2) restartability - for full-container checkpoint only. There is no "reliability" issue with c/r of non-containers - it's a matter of definition: it depends on what your requirements from the userspace application and what sort of "glue" you have for it. And I request again - let's leave out the questions of "time virtualization" and "external databases" - how are they different for the VM virtalization solution ? they are conpletely orthogonal to the question we are debating. Thanks, Oren. > > > USERSPACE GLUE possible possible > > > > SECURITY root and non-root modes root and non-root modes > > native support for LSM > > > > MAINTENANCE changes mainly for features changes mainly for features; > > create new ABI for features > > > iAnd by all means, I intend to cooperate with Gene to see how to > > make the other part of DMTCP, namely the userspace "glue", work on > > top of linux-cr to have the benefits of all worlds ! > > This is true, and we strongly welcome the cooperation. We don't know how > this experiment will turn out, but the only way to find out is to sincerely > try it. Whether we succeed or fail, we will learn something either way! > > - Gene and Kapil > > ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-23 17:53 ` Oren Laadan @ 2010-11-24 3:50 ` Kapil Arya 2010-11-25 16:04 ` Oren Laadan 0 siblings, 1 reply; 111+ messages in thread From: Kapil Arya @ 2010-11-24 3:50 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Tejun Heo, linux-kernel, xemul, Eric W. Biederman, Linux Containers (Our first comment below actually replies to an earlier post by Oren. It seemed simpler to combine our comments.) > > d. screen and other full-screen text programs These are not the only > > examples of difficult interactions with the rest of the world. > > This actually never required a userspace "component" with Zap or linux-cr (to > the best of my knowledge). We would guess that Zap would not be able to support screen without a user space component. The bug occurs when screen is configured to have a status line at the bottom. We would be interested if you want to try it and let us know the results. ============================================= > > > category linux-cr > > > userspace > > > -------------------------------------------------------------------------------- > > > PERFORMANCE has _zero_ runtime overhead visible overhead due to > > > syscalls interposition and state tracking even w/o checkpoints; > > > > In our experiments so far, the overhead of system calls has been > > unmeasurable. We never wrap read() or write(), in order to keep overhead > > low. We also never wrap pthread synchronization primitives such as locks, > > for the same reason. The other system calls are used much less often, and > > so the overhead has been too small to measure in our experiments. > > Syscall interception will have visible effect on applications that use those > syscalls. You may not observe overheasd with HPC ones, but do you have > numbers on server apps ? apps that use fork/clone and pipes extensively ? > threads benchmarks et ? compare that to aboslute zero overhead of linux-cr. Its true that we haven't taken serious data on overhead with server apps. Is there a particular server app that you are thinking of as an example? I would expect fork/clone and pipes to be invoked infrequently in the server apps and do not add measurably to CPU time. In most server apps such as MySQL, it is common to maintain a pool of threads for reuse rather than to repeatedly call clone for a new thread. This is done to ensure that the overhead of the clone calls is not significant. I would expect a similar policy for fork and pipes. <snip> > > > OPERATION applications run unmodified to do c/r, needs > > > 'controller' task (launch and manage _entire_ execution) - point of > > > failure. restricts how a system is used. > > > > We'd like to clarify what may be some misconceptions. The DMTCP controller > > does not launch or manage any tasks. The DMTCP controller is stateless, > > and is only there to provide a barrier, namespace server, and single point > > of contact to relay ckpt/restart commands. Recall that the DMTCP > > controller handls processes across hosts --- not just on a single host. > > The controller is another point of failure. I already pointed that the > (controlled) application crashes when your controller dies, and you mentioned > it's a bug that should be fixed. But then there will always be a risk for > another, and another ... You also mentioned that if the controller dies, > then the app should contionue to run, but will not be checkpointable anymore > (IIUC). > > The point is, that the controller is another point of failure, and makes the > execution/checkpoint intrusive. It also adds security and user-management > issues as you'll need one (or more ?) controller per user (right now, it's > one for all, no ?). and so on. Just to clarify, DMTCP uses one coordinator for each checkpointable computation. A single user may be running multiple computations with one coordinator for each computation. We don't actually use the word controller in DMTCP terminology because the coordinator is stateless and so in coordinating but not controlling other processes. > Plus, because the restarted apps get their virtualized IDs from the > controller, then they can't now "see" existing/new processes that may get the > "same" pids (virtualization is not in the kernel). This appears to be a misconception. The wrappers within the user process maintain the pid-translation table for that process. The translation table is the translation between the original pid given by the kernel and the current pid set by the kernel on restart. This is handled locally and does not involve the coordinator. In the case of a fork there could be a pid-clash (the original pid generated for a new process that conflicts with someone else's original pid). However, DMTCP handles this by checking within the fork wrapper for a pid-clash. In the rare case of a pid-clash, the child process exits and the parent forks again. Same applies for clone and any pid clash at restart time. > > Also, in any computation involving multiple processes, _every_ process > > of the computation is a point of failure. If any process of the > > computation dies, then the simple application strategy is to give up > > and revert to an earlier checkpoint. There are techniques by which an > > app or DMTCP can recreate certain failed processes. DMTCP doesn't > > currently recreate a dead controller (no demand for it), but it's not > > hard to do technically. > > The point is that you _add_ a point of failure: you make the "checkpoint" > operation a possible reason for the application to crash. In contrast, in > linux-cr the checkpoiint is idempotent - nunharmful because it does not make > the applications execute. Instead, it merely observes their state. We were speaking above of the case when the process dies during a computation. We were not referring to checkpoint time. <snip> We would like to add our own comment/question. To set the context we quote an earlier post: OL> Even if it did - the question is not how to deal with "glue" OL> (you demonstrated quite well how to do that with DMTCP), but OL> how should teh basic, core c/r functionality work - which is OL> below, and orthogonal to the "glue". There seems to be an implicit assumption that it is easy to separate the DMTCP "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but it splits the problems into modules along a different line than Linux C/R. We look forward to the joint experiment in which we would try to combine DMTCP with Linux C/R. This will help answer the question in our mind. In order to explore the issue, let's imagine that we have a successful merge of DMTCP and Linux C/R. The following are some user-space glue issues. It's not obvious to us how the merged software will handle these issues. 1. Sockets -- DMTCP handles all sockets in a common manner through a single module. Sockets are checkpointed independently of whether they are local or remote. In a merger of DMTCP and Linux C/R, what does Linux C/R do when it sees remote sockets? Or should DMTCP take down all remote sockets before checkpointing? If DMTCP has to do this, it would be less efficient than the current design which keeps the remote sockets connections alive during checkpoint. 2. XLib and X11-server -- Consider checkpointing a single X11 app without the X11-server and without VNC. This is something we intend to add to DMTCP in the next few months. We have already mapped out the design in our minds. An X11 application includes the Xlib library. The data of an X11 window is, by default, contained in the X11 library -- not in the X11-server. The application communicates with the X11-server using socket connections, which would be considered a leak by Linux C/R. At restart time, DMTCP will ask the X11-server to create a bare window and then make the appropriate Xlib call to repaint the window based on the data stored in the Xlib library. For checkpoint/resume, the window stays up and does not has to be repainted. How will the combined DMTCP/Linux C/R work? Will DMTCP have to take down the window prior to Linux C/R and paint a new window at resume time? Doesn't this add inefficiency? 3. Checkpointing a single process (e.g. a bash shell) talking to an xterm via a pty -- We assume that from the viewpoint of Linux C/R a pty is a leak since there is a second process operating the master end of the pty. In this case we are guessing that Linux C/R would checkpoint and restart without the gurantees of reliability. We are guessing that Linux C/R would not save and restore the pty, instead it would be the responsibility of DMTCP to restore the current settings of the pty (e.g. packet mode vs. regular mode). Is our understanding correct? Would this work? Thanks, Gene and Kapil ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-24 3:50 ` Kapil Arya @ 2010-11-25 16:04 ` Oren Laadan 2010-11-29 4:09 ` Gene Cooperman 0 siblings, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-25 16:04 UTC (permalink / raw) To: Kapil Arya Cc: Gene Cooperman, Tejun Heo, linux-kernel, xemul, Eric W. Biederman, Linux Containers [-- Attachment #1: Type: TEXT/PLAIN, Size: 4441 bytes --] On Tue, 23 Nov 2010, Kapil Arya wrote: > OL> Even if it did - the question is not how to deal with "glue" > OL> (you demonstrated quite well how to do that with DMTCP), but > OL> how should teh basic, core c/r functionality work - which is > OL> below, and orthogonal to the "glue". > > There seems to be an implicit assumption that it is easy to separate the DMTCP > "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but > it splits the problems into modules along a different line than Linux C/R. We > look forward to the joint experiment in which we would try to combine DMTCP > with Linux C/R. This will help answer the question in our mind. I apologize for being blunt - but this is probably an issue specific to DMTCP's engineering... > In order to explore the issue, let's imagine that we have a successful merge of > DMTCP and Linux C/R. The following are some user-space glue issues. It's not > obvious to us how the merged software will handle these issues. > > 1. Sockets -- DMTCP handles all sockets in a common manner through a single > module. Sockets are checkpointed independently of whether they are local or > remote. In a merger of DMTCP and Linux C/R, what does Linux C/R do when it sees > remote sockets? Or should DMTCP take down all remote sockets before > checkpointing? If DMTCP has to do this, it would be less efficient than the > current design which keeps the remote sockets connections alive during > checkpoint. What is a "local" socket ? af_unix, or locally connected af_inet ? Anyway, with linux-cr you'd do what's needed after the restarted tasks are created, but before their state is restored. For each such "old" socket that you want to replace, you'd create (in userspace with arbitrary glue" code!) a new socket, and use this socket when restoring the state of the task. Similarly, you could replace any other resource, not only sockets. > > 2. XLib and X11-server -- Consider checkpointing a single X11 app without the > X11-server and without VNC. This is something we intend to add to DMTCP in the > next few months. We have already mapped out the design in our minds. An X11 > application includes the Xlib library. The data of an X11 window is, by > default, contained in the X11 library -- not in the X11-server. The application > communicates with the X11-server using socket connections, which would be > considered a leak by Linux C/R. At restart time, DMTCP will ask the > X11-server to create a bare window and then make the appropriate Xlib call to > repaint the window based on the data stored in the Xlib library. > For checkpoint/resume, the window stays up and does not has to be repainted. > How will the combined DMTCP/Linux C/R work? Will DMTCP have to take > down the window prior to Linux C/R and paint a new window at resume time? > Doesn't this add inefficiency? Repainting during restart is the least of your problems. Leak detection is not a problem: If the socket connects out of the containers (like af_inet) - then it is not a leak, andyou treat it as described above. If the sockets connects within the container but you don't checkpoint the "peer" process - then it is not a container-c/r (in which case you don't look for leaks). Also, the application could mark resources to not be checkpointed (e.g. scratch memory to save storage, or sockets to not count as leaks). I don't see any problem with X11 or any other library and "glue". > > 3. Checkpointing a single process (e.g. a bash shell) talking to an xterm via > a pty -- We assume that from the viewpoint of Linux C/R a pty is a leak since > there is a second process operating the master end of the pty. In this > case we are > guessing that Linux C/R would checkpoint and restart without the gurantees of > reliability. We are guessing that Linux C/R would not save and restore the pty, > instead it would be the responsibility of DMTCP to restore the current settings > of the pty (e.g. packet mode vs. regular mode). Is our understanding correct? > Would this work? I explain again - in case it wasn't clear from my 3-part post: leak detection is relevant _only_ for full container-c/r. It doesn't make sense otherwise. If you want to checkpoint individual components of an application, then it's up to userspace to produce/provide the relevant "glue" to make it "make sense" when those components restart without their original eco-system. Thanks, Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-25 16:04 ` Oren Laadan @ 2010-11-29 4:09 ` Gene Cooperman 0 siblings, 0 replies; 111+ messages in thread From: Gene Cooperman @ 2010-11-29 4:09 UTC (permalink / raw) To: Oren Laadan Cc: Kapil Arya, Gene Cooperman, Tejun Heo, linux-kernel, xemul, Eric W. Biederman, Linux Containers Hi Oren, On Thu, Nov 25, 2010 at 11:04:16AM -0500, Oren Laadan wrote: > On Tue, 23 Nov 2010, Kapil Arya wrote: > > > OL> Even if it did - the question is not how to deal with "glue" > > OL> (you demonstrated quite well how to do that with DMTCP), but > > OL> how should teh basic, core c/r functionality work - which is > > OL> below, and orthogonal to the "glue". > > > > There seems to be an implicit assumption that it is easy to separate the DMTCP > > "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but > > it splits the problems into modules along a different line than Linux C/R. We > > look forward to the joint experiment in which we would try to combine DMTCP > > with Linux C/R. This will help answer the question in our mind. > > I apologize for being blunt - but this is probably an issue specific to > DMTCP's engineering... > I completely agree with you, Oren. DMTCP was never designed to be split into a userland and in-kernel replacement. We will want to re-factor DMTCP to make this happen. I'm sorry if my e-mail came off as confrontational. That was not my intention. I was just looking forward to an interesting intellectual experiment --- how to go about combining DMTCP and Linux C/R. I was trying to guess ahead of time where there are interesting challenges, and my hope is that we will find a way to solve them together. Best wishes, - Gene ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-21 8:18 ` Gene Cooperman 2010-11-21 8:21 ` Gene Cooperman @ 2010-11-21 22:41 ` Grant Likely 2010-11-22 17:34 ` Oren Laadan 2 siblings, 0 replies; 111+ messages in thread From: Grant Likely @ 2010-11-21 22:41 UTC (permalink / raw) To: Gene Cooperman Cc: Tejun Heo, Oren Laadan, Kapil Arya, linux-kernel, xemul, Eric W. Biederman, Linux Containers On Sun, Nov 21, 2010 at 03:18:53AM -0500, Gene Cooperman wrote: > In this post, Kapil and I will provide our own summary of how we > see the issues for discussion so far. In the next post, we'll reply > specifically to comment on Oren's table of comparison between > linux-cr and userspace. > > In general, we'd like to add that the conversation with Oren was very > useful for us, and I think Oren will also agree that we were able to > converge on the purely technical questions. Hi Gene, Thanks for the good summary, it helps. Some random comments below... > > Concerning opinions, we want to be cautious on opinions, since we're > still learning the context of this ongoing discussion on LKML. There is > probably still some context that we're missing. > > Below, we'll summarize the four major questions that we've understood from > this discussion so far. But before doing so, I want to point out that a single > process or process tree will always have many possible interactions with > the rest of the world. Within our own group, we have an internal slogan: > "You can't checkpoint the world." > A virtual machine can have a relatively closed world, which makes it more > robust, but checkpointing will always have some fragile parts. > We give four examples below: > a. time virtualization > b. external database > c. NSCD daemon > d. screen and other full-screen text programs > These are not the only examples of difficult interactions with the > rest of the world. > > Anyway, in my opinion, the conversation with Oren seemed to converge > into two larger cases: > 1. In a pure userland C/R like DMTCP, how many corner cases are not handled, > or could not be handled, in a pure userland approach? > Also, how important are those corner cases? Do some > have important use cases that rise above just a corner case? > [ inotify is one of those examples. For DMTCP to support this, > it would have to put wrappers around inotify_add_watch, > inotify_rm_watch, read, etc., and maybe even tracking inodes in case > the file had been renamed after the inotify_add_watch. Something > could be made to work for the common cases, but it would > still be a hack --- to be done only if a use case demands it. ] > 2. In a Linux C/R approach, it's already recognized that one needs > a userland component (for example, for convenience of recreating > the process tree on restart). How many other cases are there > that require a userland component? > [ One example here is the shared memory segment of NSCD, which > has to be re-initialized on restart. Another example is > a screen process that talks to an ANSI terminal emulator > (e.g. gnome-terminal), which talks to an X server or VNC server. > Below, we discuss these examples in more detail. ] > > One can add a third and fourth question here: > > 3. [Originally posed by Oren] Given Linux C/R, how much work would > it be to add the higher layers of DMTCP on top of Linux C/R? > [ This is a non-trivial question. As just one example, DMTCP > handles sockets uniformly, regardless of whether they > are intra-host or inter-host. Linux C/R handles certain > types of intra-host sockets. So, merging the two would > require some thought. ] > 4. [Originally posed by Tejun, e.g. Fri Nov 19 2010 - 09:04:42 EST] > Given that DMTCP checkpoints many common applications, how much work > would it be to add a small number of restricted kernel interfaces > to enable one to remove some of the hacks in DMTCP, and to cover > the more important corner cases that DMTCP might be missing? > > > I'd also like to add some points of my own here. First, there are certain > cases where I believe that a checkpoint-restart system (in-kernel > or userland or hybrid) can never be completely transparent. It's because you > can't completely cut the connection with the rest of the world. In these > examples, I'm thinking primarily of the Linux C/R mode used to checkpoint > a tree of processes. > To the extent that Linux C/R is used with containers, it seems > to me to be closer to lightweight virtualization. From there, I've > seen that the conversation goes to comparing lightweight virtualization > versus traditional virtual machines, but that discussion goes beyond my > own personal expertise. At the risk of restating already applied arguments, and as a c/r outsider, this touches on the real crux of the issue for me. What is the complete set of boundaries between a c/r group of processes and the outside world? Is it bounded and is it understandable by mere kernel engineers? Does it change the assumptions about what a Linux process /is/, and how to handle it? How much? The broad strokes seem to be straight forward, but as already pointed out, the devil is in the details. > Here are some examples that I believe that every checkpointing system > would suffer from the syndrome of trying to "checkpoint the world". > > 1. Time virtualization --- Right now, neither system does time virtualization. > Both systems could do it. But what is the right policy? > For example, one process may set a deadline for a task an hour > in the future, and then periodically poll the kernel for the current time > to see if one hour has passed. This use case seems to require time > virtualization. > A second process wants to know the current day and time, because a certain > web service updates its information at midnight each day. This use case seems > seems to argue that time virtualization is bad. Temporal issues need to be (are being?) addressed regardless. In certain respects, I'm sure c/r can be seen as a *really long* scheduler latency, and would have the same effect as a system going into suspend, or a vm-level checkpoint. I would think the same behaviour would be desirable in all cases, include c/r. > 2. External database file on another host --- It's not possible to > checkpoint the remote database file. In our work with the Condor developers, > they asked us to add a "Condor mode", which says that if there are any > external socket connections, then delay the checkpoint until the external > socket connections are closed. In a different joint project with CERN (Geneva), > we considered a checkpointing application in which an application > saves much of the database, and then on restart, discovers how much > of its data is stale, and re-loads only the stale portion. > > 3. NSCD (Network Services Caching Daemon) --- Glibc arranges for > certain information to be cached in the NSCD. The information is > in a memory segment shared between the NSCD and the application. > Upon restart, the application doesn't know that the memory segment > is no longer shared with the NSCD, or that the information is stale. > The DMTCP "hack" is to zero out this memory page on restart. Then glibc > recognizes that it needs to create a new shared memory segment. Right here is exactly the example of a boundary that needs explicit rules. When a pair of processes have a shared region, and only one of them is checkpointed, then what is the behaviour on restore? In this specific example, a context-specific hack is used to achieve the desired result, but that doesn't work (as I believe you agree) in the general case. What behaviour will in-kernel support need to enforce? > 3. screen --- The screen application sets the scrolling region of > its ANSI terminal emulator, in order to create a status line > at the bottom, while scrolling the remaining lines of the terminal. > Upon restart, screen assumes that the scrolling region > has already been set up, and doesn't have to be re-initialized. > So, on restart, DMTCP uses SIGWINCH to fool screen (or any > full-screen text-based application) into believing that its > window size has been changed. So, screen (or vim, or emacs) > then re-initializes the state of its ANSI terminal, including > scrolling regions and so on. > So, a userland component is helpful in doing the kind of hacks above. > I recognize that the Linux C/R team agrees that some userland component > can be useful. I just want to show why some userland hacks will always be > needed. Let's consider a pure in-kernel approach to checkpointing 'screen' > (or almost any full-screen application that uses a status bar at the bottom). > Screen sets the scrolling region of an ANSI terminal emulator, > which might be a gnome-terminal. So, a pure in-kernel approach > needs to also checkpoint the gnome-terminal. But the gnome-terminal > needs to talk to an X server. So, now one also needs to start > up inside a VNC server to emulate the X server. So, either > one adds a "hack" in userland to force screen to re-initialize > its ANSI terminal emulator, or else one is forced to include > an entire VNC server just to checkpoint a screen process. ] > > Finally, this excerpt below from Tejun's post sums up our views too. We don't > have the kernel expertise of the people on this list, but we've had > to do a little bit of reading the kernel code where the documentation > was sparse and in teaching O/S. We would certainly be very happy to work > closely with the kernel developers, if there was interest in extending > DMTCP to directly use more kernel support. > > - Gene and Kapil > > Tejun Heo wrote Fri Nov 19 2010 - 09:04:42 EST > > What's so wrong with Gene's work? Sure, it has some hacky aspects but > > let's fix those up. To me, it sure looks like much saner and > > manageable approach than in-kernel CR. We can add nested ptrace, > > CLONE_SET_PID (or whatever) in pidns, integrate it with various ns > > supports, add an ability to adjust brk, export inotify state via > > fdinfo and so on. > > > > The thing is already working, the codebase of core part is fairly > > small and condor is contemplating integrating it, so at least some > > people in HPC segment think it's already viable. Maybe the HPC > > cluster I'm currently sitting near is special case but people here > > really don't run very fancy stuff. In most cases, they're fairly > > simple (from system POV) C programs reading/writing data and burning a > > _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp > > integrated with condor would work well enough for them. > > > > Sure, in-kernel CR has better or more reliable coverage now but by how > > much? The basic things are already there in userland. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-21 8:18 ` Gene Cooperman 2010-11-21 8:21 ` Gene Cooperman 2010-11-21 22:41 ` Grant Likely @ 2010-11-22 17:34 ` Oren Laadan 2 siblings, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-22 17:34 UTC (permalink / raw) To: Gene Cooperman Cc: Tejun Heo, Kapil Arya, linux-kernel, xemul, Eric W. Biederman, Linux Containers On Sun, 21 Nov 2010, Gene Cooperman wrote: > Below, we'll summarize the four major questions that we've understood from > this discussion so far. But before doing so, I want to point out that a single > process or process tree will always have many possible interactions with > the rest of the world. Within our own group, we have an internal slogan: > "You can't checkpoint the world." > A virtual machine can have a relatively closed world, which makes it more > robust, but checkpointing will always have some fragile parts. That depends of what your definition of "world". One definition is "world := VM", as you state above. Another is "world := container" which I stated in my post(s). You can checkpoint both. For those cases where the "world" cannot be fully checkpointed, I explicitly pointed that we should focus on the core c/r functionality, because the "glue" can be done either way. > We give four examples below: > a. time virtualization IMHO, irrelevant to current discussion. And btw, this is done in linux-cr for live migration of tcp connections. > b. external database > c. NSCD daemon This falls within the category of "glue", and is - as I try once again to remind - tentirely oorthogonal to the topic of where to do c/r. > d. screen and other full-screen text programs > These are not the only examples of difficult interactions with the > rest of the world. This actually never required a userspace "component" with Zap or linux-cr (to the best of my knowledge).. Even if it did - the question is not how to deal with "glue" (you demonstrated quite well how to do that with DMTCP), but how should teh basic, core c/r functionality work - which is below, and orthogonal to the "glue". Let us please focus on the base c/r engine functionality... (gotta disconnect now .. more later) Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-20 19:33 ` Tejun Heo 2010-11-21 8:18 ` Gene Cooperman @ 2010-11-22 17:18 ` Oren Laadan 1 sibling, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-22 17:18 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Eric W. Biederman, Linux Containers On Sat, 20 Nov 2010, Tejun Heo wrote: > Hello, > > On 11/20/2010 07:15 PM, Oren Laadan wrote: > > > > [[apologies for the silly prefix on last two posts - a combination > > of windows, putty, pine andslow connection is not helping me :( ]] > > Maybe it's a good idea to post a clean concatenated version for later > reference? > Sure, as soon I am back on sane connection (~1 week) (I cut it in three to make it easier for people to digest ...) Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 11:57 ` Tejun Heo 2010-11-17 15:39 ` Serge E. Hallyn @ 2010-11-17 22:17 ` Matt Helsley 2010-11-18 10:06 ` Tejun Heo 2010-11-18 20:25 ` Oren Laadan 1 sibling, 2 replies; 111+ messages in thread From: Matt Helsley @ 2010-11-17 22:17 UTC (permalink / raw) To: Tejun Heo Cc: Oren Laadan, Gene Cooperman, Matt Helsley, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers On Wed, Nov 17, 2010 at 12:57:40PM +0100, Tejun Heo wrote: > Hello, Oren. > > On 11/07/2010 10:59 PM, Oren Laadan wrote: <snip> > > > Even if that happens (which is very unlikely and unnecessary), > > it will generate all the very same code in the kernel that Tejun > > has been complaining about, and _more_. And we will still suffer > > from issues such as lack of atomicity and being unable to do many > > simple and advanced optimizations. > > It may be harder but those will be localized for specific features > which would be useful for other purposes too. With in-kernel CR, > you're adding a bunch of intrusive changes which can't be tested or > used apart from CR. You seem to be arguing "Z is only testable/useful for doing the things Z was made for". I couldn't agree more with that. CR is useful for: Fault-tolerance (typical HPC) Load-balancing (less-typical HPC) Debugging (simple [e.g. instead of coredumps] or complex time-reversible) Embedded devices that need to deal with persistent low-memory situations. I think Oren's Kernel Summit presentation succinctly summarized these: http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf My personal favorite idea (that hasn't been implemented yet) is an application startup cache. I've been wondering if caching bash startup after all the shared libraries have been searched, loaded, and linked couldn't save a bunch of time spent in shell scripts. Post-link actually seems like a checkpoint in application startup which would be generally useful too. Of course you'd want to flush [portions of] the cache when packages get upgraded/removed or shell PATHs change and the caches would have to be per-user. I'm less confident but still curious about caching after running rc scripts (less confident because it would depend highly on the content of the rc scripts). A scripted boot, for example, might be able to save some time if the same rc scripts are run and they don't vary over time. That in turn might be useful for carefully-tuned boots on embedded devices. That said we don't currently have code for application caching. Yet we can't be expected to write tools for every possible use of our API in order to show just how true your tautology is. > > > Or we could use linux-cr for that: do the c/r in the kernel, > > keep the know-how in the kernel, expose (and commit to) a > > per-kernel-version ABI (not vow to keep countless new individual Oren, that statement might be read to imply that it's based on something as useless as kernel version numbers. Arnd has pointed out in the past how unsuitable that is and I tend to agree. There are at least two possible things we can relate it to: the SHA of the compiled kernel tree (which doesn't quite work because it assumes everybody uses git trees :( ), or perhaps the SHA/hash of the cpp-processed checkpoint_hdr.h. We could also stuff that header into the kernel (much like kconfigs are output from /proc) for programs that want the kernel to describe the ABI to them. > > ABIs forever after getting them wrongly...), be able to do all > > sorts of useful optimization and provide atomicity and guarantees > > (see under "leak detection" in the OLS linux-cr paper). Also, > > once the c/r infrastructure is in the kernel, it will be easy > > (and encouraged) to support new =ly introduced features. > > And the only reason it seems easier is because you're working around > the ABI problem by declaring that these binary blobs wouldn't be kept > compatible between different kernel versions and configurations. That Not true. First of all, if you look at checkpoint_hdr.h, the contents and layout of the structs don't vary according to kernel configurations. Secondly, we have taken measures to increase the likelihood that the structures will remain compatible. We've designed them to layout the same on 32-bit and 64-bit variants of an arch. We add to the end of the structs. We use an explicit length field in a header to each section to ensure that changes in the size of the structures don't necessarily break compatibility. That said, yes, these measures don't absolutely preclude incompatibility. They will however make compatibility more likely. Then there's the fact that structures like siginfo (for example) so rarely change because they're already part of an ABI. That in turn means that the corresponding checkpoint ABI rarely changes (we don't reuse the existing struct because that would require compat-syscall-style code). Most of the time, in fact, the fields we output are there only because they reflect the 'model' of how things work that the kernel presents to userspace. That model also rarely changes (we've never gotten rid of the POSIX concept of process groups in one extreme example). Perhaps the closest thing we have to wholly-kernel-internal data structures are the signal/sighand structs which echo the way these fields are split from the task struct and shared between tasks. Though I'd argue that gets back into the 'model' presented to userspace (via fork/clone) anyway... I'd estimate that the biggest 'model' changes have come via various filesystem interfaces over the years. We don't checkpoint tasks with open sysfs, /proc, or debugfs files (for example) so that's not part of our ABI and we don't intend to make it so. Nor do we output wholly kernel-internal structures and fields that are often chosen for their performance benefits (e.g. rbtrees, linked lists, hash tables, idrs, various caches, locks, RCU heads, refcounts, etc). So the kernel is free to change implementations without affecting our ABI. The compatibility has natural limits. For instance we can't ever restart an x86_64 binary on a 32-bit kernel. If you add a new syscall interface (e.g. fanotify) then you can't use a checkpoint of a task that makes use of it on fanotify-disabled kernels. Yet these limitations exist no matter where or how you choose to implement checkpoint/restart. We've made almost every effort at making this a proper ABI (I say 'almost' because we still need to export a description of it at runtime and we need to do something better in place of the logfd output). Still, the essentials of a proper checkpoint/restart ABI are already there. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 22:17 ` Matt Helsley @ 2010-11-18 10:06 ` Tejun Heo 2010-11-18 20:25 ` Oren Laadan 1 sibling, 0 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-18 10:06 UTC (permalink / raw) To: Matt Helsley Cc: Oren Laadan, Gene Cooperman, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers Hello, Matt. On 11/17/2010 11:17 PM, Matt Helsley wrote: >> It may be harder but those will be localized for specific features >> which would be useful for other purposes too. With in-kernel CR, >> you're adding a bunch of intrusive changes which can't be tested or >> used apart from CR. > > You seem to be arguing "Z is only testable/useful for doing the things Z > was made for". I couldn't agree more with that. CR is useful for: I'm saying it's way too narrow scoped and inflexible to be a kernel feature. Kernel features should be like the basic tools, you know, hammers, saws, drills and stuff. In-kernel CR is more like an over complicated food processor which usually sits in the top drawer after first several runs, > Fault-tolerance (typical HPC) > Load-balancing (less-typical HPC) > Debugging (simple [e.g. instead of coredumps] or complex > time-reversible) > Embedded devices that need to deal with persistent low-memory > situations. which can do all of the above, a lot of which can be achieved in less messy way than putting the whole thing inside the kernel. > My personal favorite idea (that hasn't been implemented yet) is an > application startup cache. I've been wondering if caching bash startup > after all the shared libraries have been searched, loaded, and linked > couldn't save a bunch of time spent in shell scripts. Post-link actually > seems like a checkpoint in application startup which would be generally > useful too. Of course you'd want to flush [portions of] the cache when > packages get upgraded/removed or shell PATHs change and the caches > would have to be per-user. What does that have anything to do with the kernel? If you want post-link cache, implement it in ld.so where it belongs. That's like using food processor to mix cement. > I'm less confident but still curious about caching after running rc > scripts (less confident because it would depend highly on the content > of the rc scripts). A scripted boot, for example, might be able to save > some time if the same rc scripts are run and they don't vary over time. > That in turn might be useful for carefully-tuned boots on embedded devices. > > That said we don't currently have code for application caching. Yet we > can't be expected to write tools for every possible use of our API in > order to show just how true your tautology is. Continuing the same line of thought. It _CAN_ be used to do that in a convoluted way but there are better ways to solve those problems. > Most of the time, in fact, the fields we output are there only because > they reflect the 'model' of how things work that the kernel presents to > userspace. That model also rarely changes (we've never gotten rid of the > POSIX concept of process groups in one extreme example). Perhaps the > closest thing we have to wholly-kernel-internal data structures are the > signal/sighand structs which echo the way these fields are split from the > task struct and shared between tasks. Though I'd argue that gets back into > the 'model' presented to userspace (via fork/clone) anyway... Yeah, exactly, so just do it inside the established ABI extending where it makes sense. No reason to add a whole separate set. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 22:17 ` Matt Helsley 2010-11-18 10:06 ` Tejun Heo @ 2010-11-18 20:25 ` Oren Laadan 1 sibling, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-18 20:25 UTC (permalink / raw) To: Matt Helsley Cc: Tejun Heo, Gene Cooperman, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers On 11/17/2010 05:17 PM, Matt Helsley wrote: > On Wed, Nov 17, 2010 at 12:57:40PM +0100, Tejun Heo wrote: >> Hello, Oren. >> >> On 11/07/2010 10:59 PM, Oren Laadan wrote: <snip> >>> Or we could use linux-cr for that: do the c/r in the kernel, >>> keep the know-how in the kernel, expose (and commit to) a >>> per-kernel-version ABI (not vow to keep countless new individual > > Oren, that statement might be read to imply that it's based on > something as useless as kernel version numbers. Arnd has pointed out in the > past how unsuitable that is and I tend to agree. There are at least two > possible things we can relate it to: the SHA of the compiled kernel tree > (which doesn't quite work because it assumes everybody uses git trees :( ), > or perhaps the SHA/hash of the cpp-processed checkpoint_hdr.h. We could > also stuff that header into the kernel (much like kconfigs are output from > /proc) for programs that want the kernel to describe the ABI to them. BTW, it's the same for userspace c/r: for the same set of features, the format (ABI) remains unchanged. Adding features breaks this and a new version is necessary, and conversion from old to new will be needed. Moreover, supporting a new feature in userspace means adding the proper API/ABI in the kernel, including refactoring etc, which is even harder than adding the support for it in linux-cr. Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 20:40 ` Gene Cooperman 2010-11-06 22:41 ` Oren Laadan @ 2010-11-07 21:44 ` Oren Laadan 2010-11-07 23:31 ` Gene Cooperman 1 sibling, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-07 21:44 UTC (permalink / raw) To: Gene Cooperman Cc: Matt Helsley, Tejun Heo, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers [cc'ing linux containers mailing list] On 11/06/2010 04:40 PM, Gene Cooperman wrote: > 8. What happens if the DMTCP coordinator ( checkpoint control process) dies? > [ The same thing that happens if a user process dies. We kill the whole > computation, and restart. At restart, we use a new coordinator. > Coordinators are stateless. ] My experience is different: I downloaded dmtcp and followed the quick-start guide: (1) "dmtcp_coordinator" on one terminal (2) "dmtcp_checkpoint bash" on another terminal Then I: (3) pkill -9 dmtcp_coordinator ... oops - 'bash' died. I didn't even try to take a checkpoint :( Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-07 21:44 ` Oren Laadan @ 2010-11-07 23:31 ` Gene Cooperman 0 siblings, 0 replies; 111+ messages in thread From: Gene Cooperman @ 2010-11-07 23:31 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Matt Helsley, Tejun Heo, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers On Sun, Nov 07, 2010 at 04:44:20PM -0500, Oren Laadan wrote: > [cc'ing linux containers mailing list] > > On 11/06/2010 04:40 PM, Gene Cooperman wrote: > > >8. What happens if the DMTCP coordinator ( checkpoint control process) dies? > > [ The same thing that happens if a user process dies. We kill the whole > > computation, and restart. At restart, we use a new coordinator. > > Coordinators are stateless. ] > > My experience is different: > > I downloaded dmtcp and followed the quick-start guide: > (1) "dmtcp_coordinator" on one terminal > (2) "dmtcp_checkpoint bash" on another terminal > > Then I: > (3) pkill -9 dmtcp_coordinator > ... oops - 'bash' died. > > I didn't even try to take a checkpoint :( You're right. I just reproduced your example. But please remember that we're working in a design space where if any process of a computation dies, then we kill the computation and restart. It doesn't matter to us if it's a user process or the DMTCP coordinator that died. I do think this is getting too detailed for the LKML list, but since you bring it up, here is the analysis. The user bash process exits with: [31331] ERROR at dmtcpmessagetypes.cpp:62 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed' _magicBits = Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator die uncleanly? This means that when the DMTCP coordinator died, it sent a message to the checkpoint thread within the user process. The message was ill-formed. The current DMTCP code says that if a checkpoint thread receives an ill-formed message from the coordinator, then it should die. It's not hard to change the protocol between DMTCP coordinator and checkpoint thread of the user process into a more robust protocol with RETRY, further ACK, etc. We haven't done this. Right now, the user simply restarts from the last checkpoint. If one process of a computation has been compromised (either DMTCP coordinator or user process), then the whole computation has been compromised. I think in a previous version of DMTCP, the policy was to allow the computation to continue when the coordinator dies. Policies change. But I think you're missing the larger point. We've developed DMTCP over six years, largely with programmers who are much less experienced than the kernel developers. Yet DMTCP works reliably for many users. I consider this a credit to the DMTCP design. The Linux C/R design is also excellent. Can we get back to questions of design, using the implementations as reference implementations? If you don't object, I'll also skip replying to the other post, since I think we're getting too detailed. I'm having trouble keeping up with the posts. :-) An offline discussion will give us time to look more carefully at these issues, and draw more careful conclusions. Thanks, - Gene ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 8:05 ` Tejun Heo 2010-11-04 16:44 ` Gene Cooperman @ 2010-11-05 22:24 ` Oren Laadan 1 sibling, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-05 22:24 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, ksummit-2010-discuss, linux-kernel, Gene Cooperman, hch On 11/04/2010 04:05 AM, Tejun Heo wrote: > Hello, > > On 11/04/2010 04:40 AM, Kapil Arya wrote: >> (Sorry for resending the message; the last message contained some html >> tags and was rejected by server) > > And please also don't top-post. Being the antisocial egomaniacs we > are, people on lkml prefer to dissect the messages we're replying to, > insert insulting comments right where they would be most effective and > remove the passages which can't yield effective insults. :-) > >> In our personal view, a key difference between in-kernel and userland >> approaches is the issue of security. The Linux C/R developers state >> the issue very well in their FAQ (question number 7): >>> https://ckpt.wiki.kernel.org/index.php/Faq : >>> 7. Can non-root users checkpoint/restart an application ? >>> >>> For now, only users with CAP_SYSADMIN privileges can C/R an >>> application. This is to ensure that the checkpoint image has not been >>> tampered with and will be treated like a loadable kernel-module. > > That's an interesting point but I don't think it's a dealbreaker. > Kernel CR is gonna require userland agent anyway and access control > can be done there. Indeed, this is a restriction on the new eclone() syscall, and can be addressed with proper userspace tools (including crypo-sign the checkpoint image). There core of the c/r code allows a user to restore anything within the user's privilege level. > Being able to snapshot w/o root privieldge > definitely is a plust but it's not like CR is gonna be deployed on > majority of desktops and servers (if so, let's talk about it then). Why not ? it has zero overhead when not in use, and a reasonable code footprint (which can be reduced by modularizing some of it, but that's outside the point). >> Strategies like these are easily handled in userspace. We suspect >> that while one may begin with a pure kernel approach, eventually, >> one will still want to add a userland component to achieve this kind >> of flexibility, just as BLCR has already done. > > Yeap, agreed. There gotta be user agents which can monitor and > manipulate userland states. It's a fundamentally nasty job, that of Are we talking about distributed checkpoint or "standalone" ? DMTCP relies on user agents to allow distributed/remote execution in a manner mostly transparent to the application. Many distributed systems don't require (and do not use) user agents. Consider a multi-tier system with web server, sql server and some applications server. These are not suitable to DMTCP's mode or work. (This is not to say DMTCP isn't useful - it's a clever piece of software with specific goals and more geared towards HPC needs). Now regarding "standalone" c/r, if you want to save/restore single or a subset of processes of a system without the rest of it, then you will always need user agents, regardless of userspace/kernel method. Likewise, their work on those tools will be as useful independently of which c/r 'engine' it uses. When you include all the relevant processes (e.g. an entire VNC session, a web server, HPC and batch jobs), you generally don't need the user agents. The checkpoint is self-contained, and linux-cr can provide you that guarantee at checkpoint time. > collecting and applying application-specific workarounds. I've only > glanced the dmtcp paper so my understanding is pretty superficial. > With that in mind, can you please answer some of my curiosities? > > * As Oren pointed out in another message, there are somethings which > could seem a bit too visible to the target application. Like the > manager thread (is it visible to the application or is it hidden by > the libc wrapper?) and reserved signal. Also, while it's true that > all programs should be ready to handle -EINTR failure from system > calls, it's something which is very difficult to verify and test and > could lead to once-in-a-blue-moon head scratchy kind of failures. If there is a will, there is (almost always) a way ;) What MTCP does, IIUC, is wrap around the applications with a complete pid-namespace (and more) in userspace. There are/were also commercial products that do that. It's a tremendous effort and I'm impressed by their (MTCP) work so far. It is important to understand that it has a price tag: performance and complexity. It's usually useful for HPC needs, but unsuitable for the generic server/VPS space. > > I think most of those issues can be tackled with minor narrow-scoped > changes to the kernel. Do you guys have things on mind which the > kernel can do to make these things more transparent or safer? Hmmm... the kernel already does much of it - for instance, we have neat pid-namespace infrastructure; does it make sense to go into the trouble of adding interfaces to provide for pid-virtalization in userspace ? we should be past that ... Moreover, your objection was based on the apparent complexity of a badly presented aggregate diff (and I disagree: most of that are simple refactoring and cleanups). However, that very set of "narrow-scoped changes" to the kernel that you suggest, will take life in the form of kernel patches that will do more than these and will achieve less. > * The feats dmtcp achieves with its set of workarounds are impressive > but at the same time look quite hairy. Christoph said that having a > standard userland C-R implementation would be quite useful and IMHO > it would be helpful in that direction if the implementation is > modularized enough so that the core functionality and the set of > workarounds can be easily separated. Is it already so? From what I understand, the 'wrapper' functionality to support distributed operation is said to be well modularized from the actual c/r engine - which will allow it to use better c/r engines; and coincidentally, I have one in mind... ;) Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-02 21:35 ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Tejun Heo 2010-11-02 21:47 ` Christoph Hellwig 2010-11-04 3:40 ` Kapil Arya @ 2010-11-04 4:03 ` Oren Laadan 2010-11-04 9:43 ` Tejun Heo 2010-11-05 3:55 ` Kapil Arya 2 siblings, 2 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-04 4:03 UTC (permalink / raw) To: Tejun Heo; +Cc: ksummit-2010-discuss, linux-kernel Hi, (disclaimer: you may want to grab a cup of your favorite coffee) On 11/02/2010 05:35 PM, Tejun Heo wrote: > (cc'ing lkml too) > Hello, > > On 11/02/2010 08:30 PM, Oren Laadan wrote: >> Following the discussion yesterday, here is a linux-cr diff that >> that is limited to changes to existing code. >> >> The diff doesn't include the eclone() patches. I also tried to strip >> off the new c/r code (either code in new files, or new code within >> #ifdef CONFIG_CHECKPOINT in existing files). >> >> I left a few such snippets in, e.g. c/r syscalls templates and >> declaration of c/r specific methods in, e.g. file_operations. >> >> The remaining changes in this patch include new freezer state >> ("CHECKPOINTING"), mostly refactoring of exsiting code, and a bit >> of new helpers. >> >> Disclaimer: don't try to compile (or apply) - this is only intended >> to give a ballpark of how the c/r patches change existing code. > > The patch size itself isn't too big but I still think it's one scary > patch mostly because the breadth of the code checkpointing needs to > modify and I suspect that probably is the biggest concern regarding > checkpoint-restart from implementation point of view. I agree, it *looks* scary. But that's mostly because it's a dumb diff out of context, rather than a standard "patch" as set of logical incremental changes. So posting this diff is probably the worst way to present the impact on existing code. It merely gives a ballpark of that. However, please keep in mind that this diff is really an aggregate of multiple unrelated, structured, small changes, including: - cleanups (e.g. x86 ptrace) - refactoring (e.g. ipc, eventpoll, user-ns) - new features/enhancements (e,g. splice, freezer, mm) I'm confident that each of these will make more sense when presented in the proper context. > > FWIW, I'm not quite convinced checkpoint-restart can be something In the ksummit presentation I gave an extensive list of real use-cases (existing and future). The slides are here: http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf For more technical details there is also the OLS-2010 paper here: http://www.cs.columbia.edu/~orenl/papers/ols2010-linuxcr.pdf presentation slide from there are here: http://www.cs.columbia.edu/~orenl/talks/ols2010-linuxcr.pdf > which can be generally useful. In controlled environments where the > target application behavior can be relatively well defined and > contained (including actions necessary to rollback in case something > goes bonkers), it would work and can be quite useful, but I'm afraid > the states which need to be saved and restored aren't defined well > enough to be generally applicable. Not only is it a difficult > problem, it actually is impossible to define common set of states to > be saved and restored - it depends on each application. I'm unsure which states you have in mind that will not be well defined. It is a difficult problem, and C/R has limitations, but I think we've got it pretty right this time :) * we save and restores *all* *execution* state of the applications (except for well-defined unsupported features; hardware devices are one such example). * we don't save FS state (use filesystem snapshots for that); but we do save runtime FS state (e.g. open files, etc). * we don't save state of peers (applications/systems) over network; but we do save network connections for proper live-migration. (Of course, there is a supporting userspace ecosystem, like utilities to do the checkpoint/restart, to freeze/thaw the application, to snapshot the filesystem etc). So unless the applications uses unsupported resource - it will be possible to checkpoint that application and restart successfully. > > As such, I have difficult time believing it can be something generally > useful. IOW, I think talking about its usage in complex environments > like common desktops is mostly handwaving. What about X sessions, > network connections, states established in other applications via dbus > or whatnot? Which files need to be snapshotted together? What about > shared mmaps? These questions are not difficult to answer in generic > way, they are impossible. I have a cool demo (and I gave one today!) that shows how I run one desktop session and restart an older desktop session that then runs in parallel to my existing session, in another windows -> so I have both current and older session running side by side. (it's an version of C/R as kernel module for older kernel, we're not yet there with linux-cr). Hand-waving ? maybe, but a pretty convincing one ;) To be clear, C/R is more generic than save/restore a single process: rather, it works on process hierarchies (and complete containers). So a checkpoint will typically capture the state of e.g. a VNC server (X session) and the applications (xterm, win-manager etc), and the dbus daemon, and all their open files, and sockets etc. (BTW, if you were to live-migrate that X session to another host, we'd save the TCP state as well; otherwise, we save the sockets in CLOSED state - analogous to what happens when your applications run again after the laptop was suspended for a long time). Likewise, in my demo, files are not snapshotted independently. Instead, the entire file system is snapshotted at once. Bottom line - it's simpler than what it sounds. Let's compare this to the save/restore of an entire VM: in VM you bundle all the state inside as a single big package (and this makes life much easier). Likewise, in C/R, we bundle all the necessary processes, e.g. an entire container, in a single big package - we pack all the data necessary to make the checkpoint self-sufficient. > > There is a very distinctive difference between system wide > suspend/hibernation and process checkpointing. Most programs are > already written with the conditions in mind which can be caused by > system level suspend/hibernation. Most programs don't expect to be > scheduled and run in any definite amount of time. There usually > are provisions for loss or failure of resources which are out of the > local system. There are corner cases which are affected and those > programs contain code to respond to suspend/hibernation. Please note > that this is about userland application behavior but not > implementation detail in the kernel. It is a much more fundamental > property. Exactly. This means that the same applications would not be upset after they are checkpointed/restarted, for the exact same reason - they know how to "recover" from that. For instance, firefox will re-establish a network connection to the web server, for instance. C/R is as *transparent* as suspend/hibernation. Applications will normally not be able to tell the difference between just having experienced a suspend/hibernation or a checkpoint/restart. > So, although checkpoint-restart can be very useful for certain > circumstances, I don't believe there can be a general implementation. > It inevitably needs to put somewhat strict restrictions on what the > applications being checkpointed are allowed to do. And after my Let me try to rephrase: there are restrictions to what applications do if they are to be successfully checkpointed. Examples: * tasks that use hardware devices (e.g. sound card), * tasks that use unsupported sockets (e.g. netlink), * tasks that use yet-unsupported feature (e.g. ptraced tasks) That said, I'm quite confident that the set of features we support (now or within easy reach) already cover a wide range of real applications and use-cases. > train of thought reaches there, I fail to see what the advantages of > in-kernel implementation would be compared to something like the > following. > > http://dmtcp.sourceforge.net/ > > Sure, in-kernel implementation would be able to fake it better, but I > don't think it's anything major. The coverage would be slightly > better but breaking the illusion wouldn't take much. Just push it a > bit further and it will break all the same. In addition, to be I beg to differ. DMTCP is indeed a very cool project. It's based on MTCP, a userspace C/R tool, and as such, is restricted like all userspace implementations. That is not to say that it isn't useful, but it is limited in what it can do. It is not my intention to bash their great work, but it's important to understand its limitations, so just a few examples: * Transparency: their papers says that it's required to link against their library, or modify the binary; they overload some signals (so the application can't use them) * Completeness: many real resources are not supported, e.g. eventpoll, ipc, pending signals, etc. * Complexity: they technically implement a virtual pid-namespace in userspace by intercepting calls to clone(). I wonder if they consider e.g. pid's saved on file owners or in afunix creds ? I'll just say it's nearly impossible with their 20K lines of code - I know because I did it in a kernel module ... * Efficiency: from userspace it can't tell which mapped pages are dirty and which aren't, not to mention doing incremental checkpoints. * Usefulness: can they live-migrate mysql server between two hosts prior to a kernel upgrade ? can they checkpoint stopped processes which cannot cooperate ? can they checkpoint/restart postgresql ? In contrast, the kernel C/R is: * much more complete and feature-rich, * entirely transparent to applications (does not need their cooperation, can even do debugged tasks) * can be highly optimized and do incremental c/r * can do live migration * is easier to maintain in the long run (because you don't need to cheat applications by intercepting their kernel calls from userspace!) * flexible to allow smart userspace to also be c/r aware, if they so wish * can provide a guarantee that a checkpoint is self-contained and can be later restarted In fact, DMTCP will be much more useful if it builds on linux-cr as its chekcpoint-restart engine ;) > useful, it would need userland framework or set of workarounds which > are aware of and can manipulate userland states anyway. For workloads What user space "state" needs to be worked-around and manipulated ? If you are referring to the file system - then a snapshot is necessary in either method, userspace or kernel. If other, then please elaborate. > for which checkpointing would be most beneficial (HPC for example), I > think something like the above would do just fine and it would make > much more sense to add small features to make userland checkpointing > work better than doing the whole thing in the kernel. Actually, because of the huge optimization potential that exists only in kernel based C/R, the HPC applications are likely to benefit tremendously too from it. Think about things like incremental checkpoint, pre-copy to minimize downtime (like live-migration), using COW to defer disk IO until after the application can resume execution, and more. None of these is possible with userspace C/R. I know of several places that do not use C/R because they can't stop their long running processes for longer than a few milliseconds. I know how to solve their problems with linux-cr. I doubt if any userspace mechanism can get there. > I think in-kernel checkpointing is in awkward place in terms of > tradeoff between its benefits and the added complexities to implement > it. If you give up coverage slightly, userland checkpointing is > there. If you need reliable coverage, proper virtualization isn't too > far away. As such, FWIW, I fail to see enough justification for the > added complexity. I'll be happy to be proven wrong tho. :-) There is a huge gap between what you can (and want) to do with checkpoint-restart between userspace and kernel implementations. Linux can profit from this feature along multiple axes, in terms of the HPC market, VPS solutions, desktop mobility, and much more. I think the added complexity is more than manageable. If you take a look at the patch-set (http://www.linux-cr.org/git) you'll see for that most of the code is straightforward, just full of details, and definitely tangent to the existing kernel code. The changes seen in this "naked" diff make more sense when they appear orderly in the context of that logic. We have shown that the mission is at reach and C/R can be more than a toy implementation. To reduce the complexity of *reviwing*, it's time to post the patch-set in small pieces that one can digest ... Thanks, Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 4:03 ` Oren Laadan @ 2010-11-04 9:43 ` Tejun Heo 2010-11-04 12:48 ` Luck, Tony 2010-11-06 10:12 ` Matt Helsley 2010-11-05 3:55 ` Kapil Arya 1 sibling, 2 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-04 9:43 UTC (permalink / raw) To: Oren Laadan; +Cc: ksummit-2010-discuss, linux-kernel Hello, Oren. On 11/04/2010 05:03 AM, Oren Laadan wrote: > (disclaimer: you may want to grab a cup of your favorite coffee) Alright, going to get my morning cup of coffee now. :-) > On 11/02/2010 05:35 PM, Tejun Heo wrote: >> The patch size itself isn't too big but I still think it's one scary >> patch mostly because the breadth of the code checkpointing needs to >> modify and I suspect that probably is the biggest concern regarding >> checkpoint-restart from implementation point of view. > > I agree, it *looks* scary. But that's mostly because it's a dumb > diff out of context, rather than a standard "patch" as set of > logical incremental changes. So posting this diff is probably the > worst way to present the impact on existing code. It merely gives > a ballpark of that. > > However, please keep in mind that this diff is really an aggregate > of multiple unrelated, structured, small changes, including: > - cleanups (e.g. x86 ptrace) > - refactoring (e.g. ipc, eventpoll, user-ns) > - new features/enhancements (e,g. splice, freezer, mm) > > I'm confident that each of these will make more sense when presented > in the proper context. Yeah, could be so but I wasn't really referring to the scariness of the patch per-se but rather how many subsystems CR needs to interact with. >> FWIW, I'm not quite convinced checkpoint-restart can be something > > In the ksummit presentation I gave an extensive list of real > use-cases (existing and future). The slides are here: > http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf > For more technical details there is also the OLS-2010 paper here: > http://www.cs.columbia.edu/~orenl/papers/ols2010-linuxcr.pdf > presentation slide from there are here: > http://www.cs.columbia.edu/~orenl/talks/ols2010-linuxcr.pdf Alright, reading... > I'm unsure which states you have in mind that will not be well defined. > > It is a difficult problem, and C/R has limitations, but I think we've > got it pretty right this time :) > > * we save and restores *all* *execution* state of the applications > (except for well-defined unsupported features; hardware devices > are one such example). > > * we don't save FS state (use filesystem snapshots for that); but > we do save runtime FS state (e.g. open files, etc). > > * we don't save state of peers (applications/systems) over network; > but we do save network connections for proper live-migration. If you think only about target processes, yeah sure, you can cover most of the stuff but that's not the impossible part. What's not defined is interaction with the rest of the system and userland. Userland ecosystem is crazy complex. You simply cannot stop, say, banshee or even pidgin, let it mingle with the rest of the system and restore it later in any safe way. > (Of course, there is a supporting userspace ecosystem, like utilities > to do the checkpoint/restart, to freeze/thaw the application, to > snapshot the filesystem etc). > > So unless the applications uses unsupported resource - it will be > possible to checkpoint that application and restart successfully. I'm afraid I can't agree with that. You can store and restore the states which kernel is aware of but that's a very small fraction of the whole picture. >> As such, I have difficult time believing it can be something generally >> useful. IOW, I think talking about its usage in complex environments >> like common desktops is mostly handwaving. What about X sessions, >> network connections, states established in other applications via dbus >> or whatnot? Which files need to be snapshotted together? What about >> shared mmaps? These questions are not difficult to answer in generic >> way, they are impossible. > > I have a cool demo (and I gave one today!) that shows how I run one > desktop session and restart an older desktop session that then runs > in parallel to my existing session, in another windows -> so I have > both current and older session running side by side. (it's an version > of C/R as kernel module for older kernel, we're not yet there with > linux-cr). Hand-waving ? maybe, but a pretty convincing one ;) > > To be clear, C/R is more generic than save/restore a single process: > rather, it works on process hierarchies (and complete containers). > So a checkpoint will typically capture the state of e.g. a VNC server > (X session) and the applications (xterm, win-manager etc), and the > dbus daemon, and all their open files, and sockets etc. Sure, you can freeze whole tree of related processes and move them around, but if you think about it, it's an already broken scenario. For example, dbus (or rather agents listening to it) doesn't only carry states specific to the set of applications being snapshotted. It also carries whole bunch of system-wide states or states for other applications. As soon as the system goes on executing after checkpointing, the checkpointed image of dbus and its agents become inconsistent and useless. You can't restore it later. You don't know what happened to other parts of the system inbetween. And this problem doesn't stem from technical details of the implementation. It's fundamental. CR tries to snapshot subset of a big state machine and then use the snapshot later or elsewhere. It doesn't and can't have full visibility into how the subset of states have and are going to interact with the rest of the states. As soon as the whole state machine makes progress, there is no guarantee of consistency. Without explicit provisions for specific applications, it just can't work in generic manner. Can I move my banshee or gwibber to my next machine transparently with in-kernel CR or even restore it later? In many cases, even I (the user) can't define what the desired states are. > (BTW, if you were to live-migrate that X session to another host, > we'd save the TCP state as well; otherwise, we save the sockets in > CLOSED state - analogous to what happens when your applications run > again after the laptop was suspended for a long time). > > Likewise, in my demo, files are not snapshotted independently. Instead, > the entire file system is snapshotted at once. > > Bottom line - it's simpler than what it sounds. Let's compare this to > the save/restore of an entire VM: in VM you bundle all the state inside > as a single big package (and this makes life much easier). Likewise, in > C/R, we bundle all the necessary processes, e.g. an entire container, > in a single big package - we pack all the data necessary to make the > checkpoint self-sufficient. So, that's why it comes down to containers and namespaces. You need to preemptively put the target applications in separate boxes so that they don't have much to do with the rest of the system. So that the states aren't intermixed and can be safely snapshotted without worrying about the rest of the system. I'm afraid that's not general or transparent at all. It's extremely invasive to how a system is setup and used. It basically is poor man's virtualization or rather partitioning without hardware support and at this point I find it very difficult to justify the added complexity. Let's just make virtualization better. >> So, although checkpoint-restart can be very useful for certain >> circumstances, I don't believe there can be a general implementation. >> It inevitably needs to put somewhat strict restrictions on what the >> applications being checkpointed are allowed to do. And after my > > Let me try to rephrase: there are restrictions to what applications > do if they are to be successfully checkpointed. Examples: > * tasks that use hardware devices (e.g. sound card), > * tasks that use unsupported sockets (e.g. netlink), > * tasks that use yet-unsupported feature (e.g. ptraced tasks) > > That said, I'm quite confident that the set of features we support > (now or within easy reach) already cover a wide range of real > applications and use-cases. I think my points are clear now. I'm not really talking about kernel resources the hierarchy of checkpointed processes are using. I'm talking about interaction with the rest of the system and how that can't be solved in general manner. > In contrast, the kernel C/R is: > > * much more complete and feature-rich, > * entirely transparent to applications (does not need their cooperation, > can even do debugged tasks) > * can be highly optimized and do incremental c/r > * can do live migration > * is easier to maintain in the long run (because you don't need to cheat > applications by intercepting their kernel calls from userspace!) > * flexible to allow smart userspace to also be c/r aware, if they so wish > * can provide a guarantee that a checkpoint is self-contained and can > be later restarted > > In fact, DMTCP will be much more useful if it builds on linux-cr > as its chekcpoint-restart engine ;) Yeah, it would definitely be interesting to think about how userland CR can be improved with some kernel support. That said, I don't think the differences listed above are that large given the common use cases. >> useful, it would need userland framework or set of workarounds which >> are aware of and can manipulate userland states anyway. For workloads > > What user space "state" needs to be worked-around and manipulated ? > > If you are referring to the file system - then a snapshot is necessary > in either method, userspace or kernel. If other, then please elaborate. I think dmtcp paper lists some of them. The message Kapil wrote in this thread also talks about handling vim. They're inevitable if you want to checkpoint subset of processes from a live system. The only reason those haven't come up with in-kernel CR yet is because they are hidden behind containers and namespaces. >> for which checkpointing would be most beneficial (HPC for example), I >> think something like the above would do just fine and it would make >> much more sense to add small features to make userland checkpointing >> work better than doing the whole thing in the kernel. > > Actually, because of the huge optimization potential that exists only > in kernel based C/R, the HPC applications are likely to benefit > tremendously too from it. Think about things like incremental > checkpoint, pre-copy to minimize downtime (like live-migration), > using COW to defer disk IO until after the application can resume > execution, and more. None of these is possible with userspace C/R. > > I know of several places that do not use C/R because they can't > stop their long running processes for longer than a few milliseconds. > I know how to solve their problems with linux-cr. I doubt if any > userspace mechanism can get there. I'm sure there will be some benefits to in-kernel implementation but the added complexity is crazy in comparison. I don't think it would be wise to include this invasive amount of code for several places which can't CR because they can't afford a few millisecs. >> I think in-kernel checkpointing is in awkward place in terms of >> tradeoff between its benefits and the added complexities to implement >> it. If you give up coverage slightly, userland checkpointing is >> there. If you need reliable coverage, proper virtualization isn't too >> far away. As such, FWIW, I fail to see enough justification for the >> added complexity. I'll be happy to be proven wrong tho. :-) > > There is a huge gap between what you can (and want) to do with > checkpoint-restart between userspace and kernel implementations. > Linux can profit from this feature along multiple axes, in terms > of the HPC market, VPS solutions, desktop mobility, and much more. > > I think the added complexity is more than manageable. If you take > a look at the patch-set (http://www.linux-cr.org/git) you'll see > for that most of the code is straightforward, just full of details, > and definitely tangent to the existing kernel code. The changes > seen in this "naked" diff make more sense when they appear orderly > in the context of that logic. > > We have shown that the mission is at reach and C/R can be more than > a toy implementation. To reduce the complexity of *reviwing*, it's > time to post the patch-set in small pieces that one can digest ... I'm sorry to be in this position but the trade off just seems way off. As I wrote earlier, the transparent part of in-kernel CR basically boils down to implementing pseudo virtualization without hardware support and given the not-too-glorious history of that and the much higher focus on proper virtualization these days, I just don't think it makes much sense. It's an extremely niche solution for niche use cases. If it were a self contained feature, sure, but it's reaching into a lot of core subsystems. Sorry, no. Thank you. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* RE: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 9:43 ` Tejun Heo @ 2010-11-04 12:48 ` Luck, Tony 2010-11-04 13:06 ` Tejun Heo 2010-11-06 10:12 ` Matt Helsley 1 sibling, 1 reply; 111+ messages in thread From: Luck, Tony @ 2010-11-04 12:48 UTC (permalink / raw) To: Tejun Heo, Oren Laadan Cc: ksummit-2010-discuss@lists.linux-foundation.org, linux-kernel@vger.kernel.org > If you think only about target processes, yeah sure, you can cover > most of the stuff but that's not the impossible part. What's not > defined is interaction with the rest of the system and userland. > Userland ecosystem is crazy complex. You simply cannot stop, say, > banshee or even pidgin, let it mingle with the rest of the system and > restore it later in any safe way. This is why I think it is important to define the limits of which kernel state features are covered (or going to be covered) by checkpoint/restart - and then list applications that are supported (Oren mentioned mysql server in this thread). It will always be easy for someone to point at some application like powertop and say "we can't migrate that, so checkpoint restart is therefore useless" ... this just is not true. This can be useful without having to be complete (as long as the limits are well defined). > I'm afraid I can't agree with that. You can store and restore the > states which kernel is aware of but that's a very small fraction of > the whole picture. See above - it may be enough to cover a significant number of useful cases. > Sure, you can freeze whole tree of related processes and move them > around, but if you think about it, it's an already broken scenario. > For example, dbus (or rather agents listening to it) doesn't only > carry states specific to the set of applications being snapshotted. > It also carries whole bunch of system-wide states or states for other > applications. As soon as the system goes on executing after > checkpointing, the checkpointed image of dbus and its agents become > inconsistent and useless. You can't restore it later. You don't know > what happened to other parts of the system inbetween. Okay - so "dbus" is in the list of "can't so that no, and will never be able to checkpoint/restore that class" - big deal. I'm getting repetitive no, but one last time: just because this can't handle every conceivable case doesn't make it useless. > I'm afraid that's not general or transparent at all. It's extremely > invasive to how a system is setup and used. It basically is poor > man's virtualization or rather partitioning without hardware support > and at this point I find it very difficult to justify the added > complexity. Let's just make virtualization better. I don't think that you'll ever make virtualization good enough to make the HPC people happy. >> I know of several places that do not use C/R because they can't >> stop their long running processes for longer than a few milliseconds. >> I know how to solve their problems with linux-cr. I doubt if any >> userspace mechanism can get there. > > I'm sure there will be some benefits to in-kernel implementation but > the added complexity is crazy in comparison. I don't think it would > be wise to include this invasive amount of code for several places > which can't CR because they can't afford a few millisecs. The CR cool-aid hasn't gotten so far into my system to accept this claim. If these "can't stop for more than a few milli-seconds" processes are HPC workloads, then I'm not seeing how you can do much to help them. I think these applications are using almost all of the RAM on the system, and most of the pages are anonymous. Just how do you checkpoint several GB of dirty pages in a few milli-seconds (when there is almost no free memory on the system)? If you have something else in mind, then please explain a little more. -Tony ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 12:48 ` Luck, Tony @ 2010-11-04 13:06 ` Tejun Heo 0 siblings, 0 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-04 13:06 UTC (permalink / raw) To: Luck, Tony Cc: Oren Laadan, ksummit-2010-discuss@lists.linux-foundation.org, linux-kernel@vger.kernel.org Hello, On 11/04/2010 01:48 PM, Luck, Tony wrote: >> If you think only about target processes, yeah sure, you can cover >> most of the stuff but that's not the impossible part. What's not >> defined is interaction with the rest of the system and userland. >> Userland ecosystem is crazy complex. You simply cannot stop, say, >> banshee or even pidgin, let it mingle with the rest of the system and >> restore it later in any safe way. > > This is why I think it is important to define the limits of > which kernel state features are covered (or going to be > covered) by checkpoint/restart - and then list applications > that are supported (Oren mentioned mysql server in this thread). > It will always be easy for someone to point at some application > like powertop and say "we can't migrate that, so checkpoint > restart is therefore useless" ... this just is not true. This > can be useful without having to be complete (as long as the > limits are well defined). > >> I'm afraid I can't agree with that. You can store and restore the >> states which kernel is aware of but that's a very small fraction of >> the whole picture. > > See above - it may be enough to cover a significant number of > useful cases. I was arguing that it is far from being _generally_ useful or transparent. If you're saying that it is something useful for certain use cases and application, yeah, sure. I never argued against that. >> I'm afraid that's not general or transparent at all. It's extremely >> invasive to how a system is setup and used. It basically is poor >> man's virtualization or rather partitioning without hardware support >> and at this point I find it very difficult to justify the added >> complexity. Let's just make virtualization better. > > I don't think that you'll ever make virtualization good enough > to make the HPC people happy. If you think about HPC, userland implementation is enough. In 99% of cases, those programs just read and write data files and burn a lot of CPU cycles. You don't need a lot of fancy stuff to do that. More important things would be integrating with job management so that snapshots and rollbacks can be automatically done. I agree that CR would be very useful for certain use cases and applications. I just can't see where the giant patchset fits between userland implementation which seems enough for the the most common use case of HPC and virtualization which is maturing fast. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 9:43 ` Tejun Heo 2010-11-04 12:48 ` Luck, Tony @ 2010-11-06 10:12 ` Matt Helsley 2010-11-06 11:03 ` Tejun Heo 2010-11-07 22:59 ` Davide Libenzi 1 sibling, 2 replies; 111+ messages in thread From: Matt Helsley @ 2010-11-06 10:12 UTC (permalink / raw) To: Tejun Heo; +Cc: Oren Laadan, ksummit-2010-discuss, linux-kernel On Thu, Nov 04, 2010 at 10:43:15AM +0100, Tejun Heo wrote: <snip> > > I'm afraid that's not general or transparent at all. It's extremely > invasive to how a system is setup and used. It basically is poor > man's virtualization or rather partitioning without hardware support > and at this point I find it very difficult to justify the added > complexity. Let's just make virtualization better. <snip> > I'm sorry to be in this position but the trade off just seems way off. > As I wrote earlier, the transparent part of in-kernel CR basically > boils down to implementing pseudo virtualization without hardware > support and given the not-too-glorious history of that and the much > higher focus on proper virtualization these days, I just don't think > it makes much sense. It's an extremely niche solution for niche use If you think specialized hardware acceleration is necessary for containers then perhaps you have a poor understanding of what a container is. Chances are if you're running a container with namespaces configured then you're already paying the performance costs of running in a container. If you've compared the performance of that kernel to your virtualization hardware then you already know how they compare. For containers everything is native. You're not emulating instructions. You're not running most instructions and trapping some. You're not running whole other kernels, coordinating sharing of pages and cpu with those kernels, etc. You're not emulating devices, busses, interrupts, etc. And you're also not then circumventing every virtualization mechanism you just added in order to provide decent performance. I rather doubt you'll see a difference between "native" hardware and... native hardware. And I expect you'll see much better performance in one of your containers than you'll ever see in some hand-waved hypothetically-improved virtualization that your response implored us to work on instead. Our checkpoint/restart patches do *NOT* implement containers. They sometimes work with containers to make use of checkpoint/restart simple. In fact they are the strategy we use to enable "generic" checkpoint/restart that you seem to think we lack. Everything else is an optimization choice that we give userspace which virtualization notably lacks. Like above, I expect that your virtualization hardware will compare unfavorably to kernel-based checkpoint/restart of containers. Imagine checkpointing "ls" or "sleep 10" in a VM. Then imagine doing so for a container. It takes way less time and way less disk for the container. (It's also going to be easier to manage since you won't have to do lots of special steps to get at the information in a container which is shutdown or even one that's running. If "mycontainer" is running then simply do: lxc-attach -n mycontainer /bin/bash Alternately, you can go through all the effort you normally do for a VM -- set up a serial console, setup getty, setup sshd, etc. I don't care -- it's more complicated than the above commandline.) So please stop asserting that a purported lack of hardware support is significant. Also please remember that we're not implementing containers in this patch set -- they're already in. Yes, our patches touch a wide variety of kernel code. You have just failed to appreciate how "wide" the kernel ABI truly is. You can't really count it by number of syscalls, number of pseudo-filesystems, etc. There's also the intended behavior of those interfaces to consider. Each piece of checkpoint/restart code is relatively self-contained. This can be confirmed merely by looking at many of the patches we've already posted enabling checkpoint/restart of that feature. Until you've tried to implement checkpoint/restart for an interface or until you've bothered to review a patch for one of them (my favorite on is eventfd: http://www.mail-archive.com/devel@openvz.org/msg21565.html ) please don't tell us it's too complex. Then compare that with your proposed ghastly stack of userspace cards -- ptrace (really more like strace) + LD_PRELOAD + a daemon... Incidentally, 20k lines of code is less than many pieces of the kernel. It's less than many: Filesystems (I've selected ones designed for rotating media or networks usually..) ext4, nfs, ocfs2, xfs, reiserfs, ntfs, gfs2, jfs, cifs, ubifs, nilfs2, btrfs Non-filesystem file-system support code: nfsd, nls It's less than one of the simpler DRM graphics drivers -- i915: $ cd drivers/gpu/drm/i915 $ wc -l *.[ch] ... 41481 total It's less than any one of the lpfc, bfa, aic7xxx, qla2xxx, and mpt2sas drivers I see under scsi. Perhaps a more fair comparison might be to compare a single driver to a single checkpointable kernel interface but it's a more-fair comparison that skews even more in our favor. Yes, when you *add it all up* it's more than half the size of the kernel/ directory. Bear in mind that the portions we add to kernel/checkpoint though are only 4603 lines long -- about the same size as many kernel/*.c files. The rest is for each kernel interface that adds/manipulates state we need to be able to checkpoint. Or arch code.. etc. So please don't base your assessment of our code on your apparently flawed notion of containers nor on the summary line of a diffstat you saw. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 10:12 ` Matt Helsley @ 2010-11-06 11:03 ` Tejun Heo 2010-11-07 22:59 ` Davide Libenzi 1 sibling, 0 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-06 11:03 UTC (permalink / raw) To: Matt Helsley; +Cc: Oren Laadan, ksummit-2010-discuss, linux-kernel Hello, On 11/06/2010 11:12 AM, Matt Helsley wrote: > If you think specialized hardware acceleration is necessary for > containers then perhaps you have a poor understanding of what a container > is. Chances are if you're running a container with namespaces configured > then you're already paying the performance costs of running in a > container. If you've compared the performance of that kernel to your > virtualization hardware then you already know how they compare. I was talking about virtualization when referring to hardware support. > So please stop asserting that a purported lack of hardware support > is significant. Also please remember that we're not implementing containers > in this patch set -- they're already in. Sure, that was my point. So, let's drop the handwaving about being transparent. > Incidentally, 20k lines of code is less than many pieces of the kernel. > It's less than many: > > Filesystems (I've selected ones designed for rotating media or networks usually..) > ext4, nfs, ocfs2, xfs, reiserfs, ntfs, gfs2, jfs, cifs, ubifs, nilfs2, btrfs > > Non-filesystem file-system support code: > nfsd, nls > > It's less than one of the simpler DRM graphics drivers -- i915: > $ cd drivers/gpu/drm/i915 > $ wc -l *.[ch] > ... > 41481 total > > It's less than any one of the lpfc, bfa, aic7xxx, qla2xxx, and mpt2sas > drivers I see under scsi. Perhaps a more fair comparison might be to compare > a single driver to a single checkpointable kernel interface but it's > a more-fair comparison that skews even more in our favor. Yeah, and imagine what people would say if ext4, or heaven forbid, aic7xxx code was scattered all over the kernel. > Yes, when you *add it all up* it's more than half the size of the kernel/ > directory. Bear in mind that the portions we add to kernel/checkpoint though > are only 4603 lines long -- about the same size as many kernel/*.c files. > The rest is for each kernel interface that adds/manipulates state we need to > be able to checkpoint. Or arch code.. etc. > > So please don't base your assessment of our code on your apparently > flawed notion of containers nor on the summary line of a diffstat > you saw. I don't believe my notion of containers was or is flawed and already said that the diffstat per-se didn't look too bad. With enough benefits, I wouldn't be opposed against the rather invasive changes. It's just that the whole thing is conceived backwards and there are already working alternatives which may be somewhat messy now but nevertheless achieve about the same effect without the craziness of serializing in-kernel data structures which are already mostly visible to userland to begin with. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 10:12 ` Matt Helsley 2010-11-06 11:03 ` Tejun Heo @ 2010-11-07 22:59 ` Davide Libenzi 2010-11-08 2:32 ` david 1 sibling, 1 reply; 111+ messages in thread From: Davide Libenzi @ 2010-11-07 22:59 UTC (permalink / raw) To: Matt Helsley Cc: Tejun Heo, Oren Laadan, ksummit-2010-discuss, Linux Kernel Mailing List On Sat, 6 Nov 2010, Matt Helsley wrote: > Yes, our patches touch a wide variety of kernel code. You have just failed > to appreciate how "wide" the kernel ABI truly is. You can't really count > it by number of syscalls, number of pseudo-filesystems, etc. There's > also the intended behavior of those interfaces to consider. Each piece > of checkpoint/restart code is relatively self-contained. This can be > confirmed merely by looking at many of the patches we've already posted > enabling checkpoint/restart of that feature. Until you've tried to > implement checkpoint/restart for an interface or until you've bothered > to review a patch for one of them (my favorite on is eventfd: > http://www.mail-archive.com/devel@openvz.org/msg21565.html ) please > don't tell us it's too complex. Then compare that with your proposed > ghastly stack of userspace cards -- ptrace (really more like strace) + > LD_PRELOAD + a daemon... > > Incidentally, 20k lines of code is less than many pieces of the kernel. > It's less than many: > > Filesystems (I've selected ones designed for rotating media or networks usually..) > ext4, nfs, ocfs2, xfs, reiserfs, ntfs, gfs2, jfs, cifs, ubifs, nilfs2, btrfs > > Non-filesystem file-system support code: > nfsd, nls > > It's less than one of the simpler DRM graphics drivers -- i915: > $ cd drivers/gpu/drm/i915 > $ wc -l *.[ch] > ... > 41481 total > > It's less than any one of the lpfc, bfa, aic7xxx, qla2xxx, and mpt2sas > drivers I see under scsi. Perhaps a more fair comparison might be to compare > a single driver to a single checkpointable kernel interface but it's > a more-fair comparison that skews even more in our favor. Please, do not compare things like single file systems, drivers, or otherwise fairly isolated components, with this "thing". This thing touches a freaky-large number of subsystems, effectively adding a glueage between them, which can might end up causing problems (and/or restrict design choices) in the future. The naked patch looks like just a sugar coating to me, which left out 300+ lines of extra logic in epoll alone. This is one of the widest, deepest, intrusive patches I have seen in a while, whose inclusion would require a little bit more than handwaving and continuous re-posting IMO. - Davide ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-07 22:59 ` Davide Libenzi @ 2010-11-08 2:32 ` david 2010-11-18 20:41 ` Oren Laadan 0 siblings, 1 reply; 111+ messages in thread From: david @ 2010-11-08 2:32 UTC (permalink / raw) To: Davide Libenzi Cc: Matt Helsley, Tejun Heo, Oren Laadan, ksummit-2010-discuss, Linux Kernel Mailing List On Sun, 7 Nov 2010, Davide Libenzi wrote: > Please, do not compare things like single file systems, drivers, or > otherwise fairly isolated components, with this "thing". > This thing touches a freaky-large number of subsystems, effectively > adding a glueage between them, which can might end up causing problems > (and/or restrict design choices) in the future. I've got a question about the ABI that would be created I see two possible areas that could be considered an ABI 1. control of the C/R process This is very clearly a userspace ABI, to be figured out and locked down like any other ABI 2. the details of how things are stored and added back into a system This is not as clear. at one extreme, this could be like the module interface, (the checkpointed image is only guaranteed to work on a new system with a kernel compiled with the same config options as the system it was checkpointed from). At the other extreme, this could be something that allows you to ckeckpoint an image on 2.6.40 and restore it on 2.6.80. Or it could be something in between. I don't see any way that it is sane to make the C/R image defiition and interface (#2) be an ABI that is guaranteed to never change without hurting future kernel development (exactly the type of things that Davide is worried about above), but what sort of guarantee are people interested in? is it enough to sa that it must be the same kernel version compiled with the same options? (or at least the same options for some list of things that matter, most device drivers probably would not matter for example) or would you need compatibility across all compile options for a kernel release? would you require compatibility between 2.6.x.y and 2.6.x.z? would you require compatibility between 2.6.x and 2.6.x+n (for some value of n)? is this something that could go in with the weakest guarantee initially, and then as everyone is more comfortable with it, start extending the guarantee (and as-needed adding code to the kernel to maintain compatibility with old images)? would you require compatibility between 2.6.x and 2.6.x-n? David Lang ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 2:32 ` david @ 2010-11-18 20:41 ` Oren Laadan 0 siblings, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-18 20:41 UTC (permalink / raw) To: david Cc: Davide Libenzi, Matt Helsley, Tejun Heo, ksummit-2010-discuss, Linux Kernel Mailing List On 11/07/2010 09:32 PM, david@lang.hm wrote: > On Sun, 7 Nov 2010, Davide Libenzi wrote: > >> Please, do not compare things like single file systems, drivers, or >> otherwise fairly isolated components, with this "thing". >> This thing touches a freaky-large number of subsystems, effectively >> adding a glueage between them, which can might end up causing problems >> (and/or restrict design choices) in the future. > > I've got a question about the ABI that would be created > > I see two possible areas that could be considered an ABI > > 1. control of the C/R process > > This is very clearly a userspace ABI, to be figured out and locked > down like any other ABI > > 2. the details of how things are stored and added back into a system > > This is not as clear. at one extreme, this could be like the module > interface, (the checkpointed image is only guaranteed to work on a new > system with a kernel compiled with the same config options as the system > it was checkpointed from). At the other extreme, this could be something > that allows you to ckeckpoint an image on 2.6.40 and restore it on > 2.6.80. Or it could be something in between. > > I don't see any way that it is sane to make the C/R image defiition and > interface (#2) be an ABI that is guaranteed to never change without > hurting future kernel development (exactly the type of things that > Davide is worried about above), but what sort of guarantee are people > interested in? Agreed. The guarantee should be to specific kernels, in a sense (see Matt's post in this thread 11/17). The image format is tied to "set of features supported" (which boils down to something like kernel version). The format is constructed in a modular way such that most new features can be added without breaking old format. For the rare cases that they do, conversion can be done in userspace in a straightforward manner. (All you need is convert from N to N+1). > > is it enough to sa that it must be the same kernel version compiled with > the same options? (or at least the same options for some list of things > that matter, most device drivers probably would not matter for example) > > or would you need compatibility across all compile options for a kernel > release? > > would you require compatibility between 2.6.x.y and 2.6.x.z? > > would you require compatibility between 2.6.x and 2.6.x+n (for some > value of n)? > > is this something that could go in with the weakest guarantee initially, > and then as everyone is more comfortable with it, start extending the > guarantee (and as-needed adding code to the kernel to maintain > compatibility with old images)? > > would you require compatibility between 2.6.x and 2.6.x-n? We don't "require" compatibility. The compatibility is defined per object (type) in the image format. New objects need not break compatibility. Changes to objects are very rare; and when they happen they "bump" the version. This can help avoid issues related to kernel configs/options. Restarting an image incompatible with a particular kernel will fail, adjustments should be done by userspace filtering. Thanks, Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-04 4:03 ` Oren Laadan 2010-11-04 9:43 ` Tejun Heo @ 2010-11-05 3:55 ` Kapil Arya 2010-11-05 11:57 ` Luck, Tony ` (2 more replies) 1 sibling, 3 replies; 111+ messages in thread From: Kapil Arya @ 2010-11-05 3:55 UTC (permalink / raw) To: Oren Laadan Cc: Tejun Heo, ksummit-2010-discuss, linux-kernel, Gene Cooperman, Kapil Arya (Sorry for the length of this email, we are excited about being able to discuss technical details.) This is wonderful to have this exchange of techniques and visions. Oren, we are guessing that you are at Columbia. If so, we would love to have you come up here and give a talk in Boston. Alternatively, if you prefer, we would be happy to go to Columbia and give a talk there. In comparing functionality, one recent bug we had to overcome was with screen with a hardstatus line and a scroll region for the terminal. We eventually solved it in a subtle way by sending SIGWINCH, and then lying to screen about changing the kernel window size, and then sending screen another SIGWINCH while telling it the true window size. We were pleased to see that Linux C/R also supports screen and we are curious how it handles this issue of restoring the scroll region in the X11 terminal window. Thanks. Oren noted that sometimes it's important to stop the process only for a few miliseconds while one checkpoints. In DMTCP, we do that by configuring with --enable-forked-checkpointing. This causes us to fork a child process taking advantage of copy-on-write and then checkpoint the memory pages of the child while the parent continues to execute. > So a checkpoint will typically capture the state of e.g. a VNC server (X > session) and the applications (xterm, win-manager etc), and the dbus daemon, > and all their open files, and sockets etc. This is a good example of distinct approaches when starting from Kernel C/R or user-space C/R. We currently checkpoint VNC servers in a way similar to Linux C/R. However, in the next few months, we want to directly checkpoint a single X-windows application without the X11-server. The approach is easily understood by analogy. Currently libc.so talks to the kernel. At checkpoint time, we interrogate the kernel state and then "break" the connection to the kernel and checkpoint. Similarly, libX11.so (or libX11-xcb.so) talks to the X11-server. At checkpoint time, we will interrogate the state of the X11-server and then break the connection and checkpoint. > DMTCP is indeed a very cool project. ... It is not my intention to bash > their great work, but it's important to understand its limitations, so just a > few examples: Thanks very much for bringing up these implementation questions. Its wonderful to have someone interested in the low level technology to talk to. We would like to share with you our current solutions and our plans for the future. We will also add some of our question about Linux C/R inline. Thanks for the answers in advance. > required to link against their library, or modify the binary; We currently use LD_PRELOAD to transparently preload our library. The user doesn't see this. If the application is statically linked, then this doesn't work. Until now, we haven't seen user requests to support statically linked applications. If we do, there are other techniques to modify the call sites or entry points for libc routines within the user binary. > They overload some signals (so the application can't use them) By default, DMTCP uses SIGUSR2. At process startup, the user can specify: dmtcp_checkpoint --mtcp-checkpoint-signal <signum> a.out to change the DMTCP signal. In an additional point we have found interesting, libc has a similar policy of using several hardwired signal: #define SIGCANCEL __SIGRTMIN #define SIGTIMER SIGCANCEL #define SIGSETXID (__SIGRTMIN + 1) So there is a precedent for this approach. > Completeness: many real resources are not supported, e.g. eventpoll, ipc, > pending signals, etc. IPC and pending signals are supported. We know how to do eventpoll but haven't encountered a use case from our userbase and so haven't added it yet. > * Complexity: they technically implement a virtual pid-namespace in userspace > by intercepting calls to clone(). I wonder if they consider e.g. pid's saved > on file owners or in afunix creds ? I'll just say it's nearly impossible with > their 20K lines of code - I know because I did it in a kernel module ... We do wrap clone and create a table from original PID/TID to current PID/TID just as you say. To our knowledge, we have wrappers for all system calls involving a PID/TID except fcntl. We are guessing that either Linux C/R also keeps a translation table or else restores the original PID/TID. Which do you do? In the latter case what do you do if a PID/TID is already used by another process/thread? > * Efficiency: from userspace it can't tell which mapped pages are dirty and > which aren't, not to mention doing incremental checkpoints. One of the DMTCP team, Artem Polyakov, has developed incremental checkpointing for DMTCP and for BLCR. We are still evaluating it. It's at: http://sourceforge.net/projects/hbict > * Usefulness: can they live-migrate mysql server between two hosts prior to a > kernel upgrade ? We have not experimented with live-migration. Live-migration in user space is an interesting topic but will take us into deep discussion outside of the current scope. Of course VMware and others already do it. We would enjoy talking further with you offline. It's certainly a cool use case. > can they checkpoint stopped processes which cannot cooperate ? We haven't had a user request for checkpointing stopped processes so far. However one can use PTRACE (similar to doing gdb attach on stopped process) to achieve this. > can they checkpoint/restart postgresql ? We don't know. We have succeeded on MySQL. We never tried postgresql. What are the special issues there? > In contrast, the kernel C/R is: > ... > * entirely transparent to applications (does not need their cooperation, can > even do debugged tasks) We are not sure what you are referring to by cooperation and debugged tasks. If it helps, we can say that DMTCP can checkpoint an entire gdb session or just the process being debugged by the gdb, according to the requirements. Our support for PTRACE is in the unstable branch. > * is easier to maintain in the long run (because you don't need to cheat > applications by intercepting their kernel calls from userspace!) We have to agree to disagree on this one. We see almost no new bugs or issues with kernel upgrades. The most recent case was the need to add the wrapper for pipe2 (2.6.27) and accept4 (2.6.28) and each wrapper was about 20 new lines of code. > * flexible to allow smart userspace to also be c/r aware, if they so wish DMTCP also has a dmtcpaware facility by which applications can request checkpoints for themselves or other processes. It also support user hook functions for checkpoint, resume, and restart. > * can provide a guarantee that a checkpoint is self-contained and can be > later restarted Could you tell us more about what do you mean by gurantee and self-contained? > In fact, DMTCP will be much more useful if it builds on linux-cr as its > chekcpoint-restart engine ;) Your suggestion is an interesting one. One of our team members, Jason Ansel, has made the same suggestion with respect to BLCR. This would be a great experiment to try and we would be glad to work with you to get an initial version of DMTCP on top of Linux C/R. DMTCP has a higher layer dmtcphijack.so and a lower layer libmtcp.so (MTCP) which can be replaced by a modified single process checkpointer with hooks for dmtcphijack.so. Unfortunately, our group doesn't have the resources to maintain and develop two branches: DMTCP/MTCP and DMTCP/Linux C/R. Nevertheless, if you were interested in going forward on the DMTCP/Linux C/R branch, we could share code and ideas. > Actually, because of the huge optimization potential that exists only in > kernel based C/R, the HPC applications are likely to benefit tremendously too > from it. Think about things like incremental checkpoint, pre-copy to minimize > downtime (like live-migration), using COW to defer disk IO until after the > application can resume execution, and more. None of these is possible with > userspace C/R. BLCR is a kernel-based C/R package, and appears to be the current standard for HPC. Are you saying that BLCR should be replaced by Linux C/R, if so, why? Concerning user space C/R, please see our comments above. > I know of several places that do not use C/R because they can't stop their > long running processes for longer than a few milliseconds. I know how to > solve their problems with linux-cr. I doubt if any userspace mechanism can > get there. DMTCP supports forked checkpointing as a configure option. A child is forked using COW and it writes its memory to disk at leisure. Thanks, Gene Cooperman and Kapil Arya ^ permalink raw reply [flat|nested] 111+ messages in thread
* RE: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-05 3:55 ` Kapil Arya @ 2010-11-05 11:57 ` Luck, Tony 2010-11-05 17:17 ` Gene Cooperman 2010-11-05 17:31 ` Sukadev Bhattiprolu 2010-11-06 21:05 ` Oren Laadan 2 siblings, 1 reply; 111+ messages in thread From: Luck, Tony @ 2010-11-05 11:57 UTC (permalink / raw) To: Kapil Arya, Oren Laadan Cc: ksummit-2010-discuss@lists.linux-foundation.org, Gene Cooperman, linux-kernel@vger.kernel.org > Oren noted that sometimes it's important to stop the process only > for a few milliseconds while one checkpoints. In DMTCP, we do that > by configuring with --enable-forked-checkpointing. This causes us > to fork a child process taking advantage of copy-on-write and then > checkpoint the memory pages of the child while the parent continues > to execute. Interesting ... but while the process is only stopped for the duration of the fork, it may be taking COW faults on almost every page it touches. I think this will not work well for large HPC applications that allocate most of physical memory as anonymous pages for the application. It may even result in an OOM kill if you don't complete the checkpoint of the child and have it exit in a timely manner. -Tony ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-05 11:57 ` Luck, Tony @ 2010-11-05 17:17 ` Gene Cooperman 2010-11-06 1:16 ` Matt Helsley 2010-11-06 21:00 ` Oren Laadan 0 siblings, 2 replies; 111+ messages in thread From: Gene Cooperman @ 2010-11-05 17:17 UTC (permalink / raw) To: Luck, Tony Cc: Kapil Arya, Oren Laadan, ksummit-2010-discuss@lists.linux-foundation.org, Gene Cooperman, linux-kernel@vger.kernel.org On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote: > > Oren noted that sometimes it's important to stop the process only > > for a few milliseconds while one checkpoints. In DMTCP, we do that > > by configuring with --enable-forked-checkpointing. This causes us > > to fork a child process taking advantage of copy-on-write and then > > checkpoint the memory pages of the child while the parent continues > > to execute. > > Interesting ... but while the process is only stopped for the duration > of the fork, it may be taking COW faults on almost every page it > touches. I think this will not work well for large HPC applications > that allocate most of physical memory as anonymous pages for the > application. It may even result in an OOM kill if you don't complete > the checkpoint of the child and have it exit in a timely manner. > > -Tony > I agree with you that forked checkpointing is probably not what you want in the middle of an HPC computation. But isn't that part of the nature of COW? Whether the COW is invoked within the kernel, or from outside the kernel via fork --- in either case, when you have mostly dirty pages, you will have to copy most of the pages. Do I understand your point correctly? Thanks, - Gene ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-05 17:17 ` Gene Cooperman @ 2010-11-06 1:16 ` Matt Helsley 2010-11-06 4:06 ` Oren Laadan 2010-11-06 21:00 ` Oren Laadan 1 sibling, 1 reply; 111+ messages in thread From: Matt Helsley @ 2010-11-06 1:16 UTC (permalink / raw) To: Gene Cooperman Cc: Luck, Tony, Kapil Arya, Oren Laadan, ksummit-2010-discuss@lists.linux-foundation.org, linux-kernel@vger.kernel.org On Fri, Nov 05, 2010 at 01:17:03PM -0400, Gene Cooperman wrote: > On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote: > > > Oren noted that sometimes it's important to stop the process only > > > for a few milliseconds while one checkpoints. In DMTCP, we do that > > > by configuring with --enable-forked-checkpointing. This causes us > > > to fork a child process taking advantage of copy-on-write and then > > > checkpoint the memory pages of the child while the parent continues > > > to execute. > > > > Interesting ... but while the process is only stopped for the duration > > of the fork, it may be taking COW faults on almost every page it > > touches. I think this will not work well for large HPC applications > > that allocate most of physical memory as anonymous pages for the > > application. It may even result in an OOM kill if you don't complete > > the checkpoint of the child and have it exit in a timely manner. > > > > -Tony > > > > I agree with you that forked checkpointing is probably not what you > want in the middle of an HPC computation. But isn't that part of > the nature of COW? Whether the COW is invoked within the kernel, > or from outside the kernel via fork --- in either case, when you have > mostly dirty pages, you will have to copy most of the pages. The current linux-cr approach to handling [dirty] pages doesn't use COW. The tasks are frozen using the cgroup freezer and thus unable to modify the pages. So we don't have to mess with page tables nor do we pay any extra overhead for page faults. If we ever implement thawed checkpointing -- checkpointing while the task isn't frozen -- then we'd probably use COW and see the same faults. The difference then would be that in-kernel we wouldn't have one extra task per mm being checkpointed. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 1:16 ` Matt Helsley @ 2010-11-06 4:06 ` Oren Laadan 2010-11-06 5:18 ` Matt Helsley 0 siblings, 1 reply; 111+ messages in thread From: Oren Laadan @ 2010-11-06 4:06 UTC (permalink / raw) To: Matt Helsley Cc: Gene Cooperman, Luck, Tony, Kapil Arya, ksummit-2010-discuss@lists.linux-foundation.org, linux-kernel@vger.kernel.org On 11/05/2010 09:16 PM, Matt Helsley wrote: > On Fri, Nov 05, 2010 at 01:17:03PM -0400, Gene Cooperman wrote: >> On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote: >>>> Oren noted that sometimes it's important to stop the process only >>>> for a few milliseconds while one checkpoints. In DMTCP, we do that >>>> by configuring with --enable-forked-checkpointing. This causes us >>>> to fork a child process taking advantage of copy-on-write and then >>>> checkpoint the memory pages of the child while the parent continues >>>> to execute. >>> >>> Interesting ... but while the process is only stopped for the duration >>> of the fork, it may be taking COW faults on almost every page it >>> touches. I think this will not work well for large HPC applications >>> that allocate most of physical memory as anonymous pages for the >>> application. It may even result in an OOM kill if you don't complete >>> the checkpoint of the child and have it exit in a timely manner. >>> >>> -Tony >>> >> >> I agree with you that forked checkpointing is probably not what you >> want in the middle of an HPC computation. But isn't that part of >> the nature of COW? Whether the COW is invoked within the kernel, >> or from outside the kernel via fork --- in either case, when you have >> mostly dirty pages, you will have to copy most of the pages. > > The current linux-cr approach to handling [dirty] pages doesn't use COW. > The tasks are frozen using the cgroup freezer and thus unable to modify > the pages. So we don't have to mess with page tables nor do we pay > any extra overhead for page faults. The current linux-cr patchset leaves out any optimizations for simplicity of reviewing - first get it working and reviewed. We experienced with optimizations with previous systems. > > If we ever implement thawed checkpointing -- checkpointing while > the task isn't frozen -- then we'd probably use COW and see > the same faults. The difference then would be that in-kernel we > wouldn't have one extra task per mm being checkpointed. Thawed checkpointing can be done with any COW tax, by leveraging the native hardware dirty bit in page tables. There is no need to trigger additional checkpoints. Tracking modified pages using the dirty bit is a feature also desired by the KVM community, and we plan to work with them on implementing it. Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-06 4:06 ` Oren Laadan @ 2010-11-06 5:18 ` Matt Helsley 0 siblings, 0 replies; 111+ messages in thread From: Matt Helsley @ 2010-11-06 5:18 UTC (permalink / raw) To: Oren Laadan Cc: Matt Helsley, Gene Cooperman, Luck, Tony, Kapil Arya, ksummit-2010-discuss@lists.linux-foundation.org, linux-kernel@vger.kernel.org On Sat, Nov 06, 2010 at 12:06:09AM -0400, Oren Laadan wrote: > On 11/05/2010 09:16 PM, Matt Helsley wrote: > > On Fri, Nov 05, 2010 at 01:17:03PM -0400, Gene Cooperman wrote: > >> On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote: > >>>> Oren noted that sometimes it's important to stop the process only > >>>> for a few milliseconds while one checkpoints. In DMTCP, we do that > >>>> by configuring with --enable-forked-checkpointing. This causes us > >>>> to fork a child process taking advantage of copy-on-write and then > >>>> checkpoint the memory pages of the child while the parent continues > >>>> to execute. > >>> > >>> Interesting ... but while the process is only stopped for the duration > >>> of the fork, it may be taking COW faults on almost every page it > >>> touches. I think this will not work well for large HPC applications > >>> that allocate most of physical memory as anonymous pages for the > >>> application. It may even result in an OOM kill if you don't complete > >>> the checkpoint of the child and have it exit in a timely manner. <snip> > > The current linux-cr approach to handling [dirty] pages doesn't use COW. > > The tasks are frozen using the cgroup freezer and thus unable to modify > > the pages. So we don't have to mess with page tables nor do we pay > > any extra overhead for page faults. > > The current linux-cr patchset leaves out any optimizations > for simplicity of reviewing - first get it working and reviewed. > We experienced with optimizations with previous systems. > > > If we ever implement thawed checkpointing -- checkpointing while > > the task isn't frozen -- then we'd probably use COW and see > > the same faults. The difference then would be that in-kernel we > > wouldn't have one extra task per mm being checkpointed. > > Thawed checkpointing can be done with any COW tax, by leveraging > the native hardware dirty bit in page tables. There is no need to > trigger additional checkpoints. Tracking modified pages using the s/checkpoints/faults/ Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-05 17:17 ` Gene Cooperman 2010-11-06 1:16 ` Matt Helsley @ 2010-11-06 21:00 ` Oren Laadan 1 sibling, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-06 21:00 UTC (permalink / raw) To: Gene Cooperman Cc: Luck, Tony, Kapil Arya, ksummit-2010-discuss@lists.linux-foundation.org, linux-kernel@vger.kernel.org On 11/05/2010 01:17 PM, Gene Cooperman wrote: > On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote: >>> Oren noted that sometimes it's important to stop the process only >>> for a few milliseconds while one checkpoints. In DMTCP, we do that >>> by configuring with --enable-forked-checkpointing. This causes us >>> to fork a child process taking advantage of copy-on-write and then >>> checkpoint the memory pages of the child while the parent continues >>> to execute. >> >> Interesting ... but while the process is only stopped for the duration >> of the fork, it may be taking COW faults on almost every page it >> touches. I think this will not work well for large HPC applications >> that allocate most of physical memory as anonymous pages for the >> application. It may even result in an OOM kill if you don't complete >> the checkpoint of the child and have it exit in a timely manner. >> >> -Tony >> > > I agree with you that forked checkpointing is probably not what you > want in the middle of an HPC computation. But isn't that part of > the nature of COW? Whether the COW is invoked within the kernel, > or from outside the kernel via fork --- in either case, when you have > mostly dirty pages, you will have to copy most of the pages. > Do I understand your point correctly? Thanks, > - Gene COW is one way of reducing down time (whether through fork or in-kernel checkpoint). However, it is possible to avoid using it (and thus avoid extra page faults and memory overload) by using the page-table "dirty" bit to track dirty pages. This way one can "pre-copy" the checkpoint image while the application is running, without additional overhead (the idea is similar to how live-migration is done). Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-05 3:55 ` Kapil Arya 2010-11-05 11:57 ` Luck, Tony @ 2010-11-05 17:31 ` Sukadev Bhattiprolu 2010-11-06 21:05 ` Oren Laadan 2 siblings, 0 replies; 111+ messages in thread From: Sukadev Bhattiprolu @ 2010-11-05 17:31 UTC (permalink / raw) To: Kapil Arya Cc: Oren Laadan, Tejun Heo, ksummit-2010-discuss, linux-kernel, Gene Cooperman On Thu, Nov 4, 2010 at 8:55 PM, Kapil Arya <kapil@ccs.neu.edu> wrote: >> * Complexity: they technically implement a virtual pid-namespace in userspace >> by intercepting calls to clone(). I wonder if they consider e.g. pid's saved >> on file owners or in afunix creds ? I'll just say it's nearly impossible with >> their 20K lines of code - I know because I did it in a kernel module ... > > We do wrap clone and create a table from original PID/TID to current PID/TID > just as you say. To our knowledge, we have wrappers for all system calls > involving a PID/TID except fcntl. We are guessing that either Linux C/R also > keeps a translation table or else restores the original PID/TID. Which do you > do? In the latter case what do you do if a PID/TID is already used by another > process/thread? > Like Oren said, we run the application inside the container - which would have its own pid namespace. When we restart, we again create a container, which starts with a fresh pid namespace, so the pids will not be in use. IOW, a process has a virtual pid and a global pid. The virtual pid is what the application sees when it calls getpid() and that pid will be correctly restored when you create the container. Sukadev ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-05 3:55 ` Kapil Arya 2010-11-05 11:57 ` Luck, Tony 2010-11-05 17:31 ` Sukadev Bhattiprolu @ 2010-11-06 21:05 ` Oren Laadan 2 siblings, 0 replies; 111+ messages in thread From: Oren Laadan @ 2010-11-06 21:05 UTC (permalink / raw) To: Kapil Arya; +Cc: Tejun Heo, ksummit-2010-discuss, linux-kernel, Gene Cooperman On 11/04/2010 11:55 PM, Kapil Arya wrote: > (Sorry for the length of this email, we are excited about being able > to discuss technical details.) > > This is wonderful to have this exchange of techniques and visions. Oren, we > are guessing that you are at Columbia. If so, we would love to have you come up > here and give a talk in Boston. Alternatively, if you prefer, we would be happy > to go to Columbia and give a talk there. With pleasure. (LPC would have been a good opportunity - I was in Boston). Oren. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] <Pine.LNX.4.64.1011021530470.12128@takamine.ncl.cs.columbia.edu> 2010-11-02 21:35 ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Tejun Heo @ 2010-11-08 16:55 ` Grant Likely 2010-11-08 21:01 ` Nathan Lynch ` (2 more replies) 1 sibling, 3 replies; 111+ messages in thread From: Grant Likely @ 2010-11-08 16:55 UTC (permalink / raw) To: Oren Laadan Cc: ksummit-2010-discuss, Linux Kernel Mailing List, Christoph Hellwig On Tue, Nov 2, 2010 at 3:30 PM, Oren Laadan <orenl@cs.columbia.edu> wrote: > Hi, > > Following the discussion yesterday, here is a linux-cr diff that > that is limited to changes to existing code. > > The diff doesn't include the eclone() patches. I also tried to strip > off the new c/r code (either code in new files, or new code within > #ifdef CONFIG_CHECKPOINT in existing files). > > I left a few such snippets in, e.g. c/r syscalls templates and > declaration of c/r specific methods in, e.g. file_operations. > > The remaining changes in this patch include new freezer state > ("CHECKPOINTING"), mostly refactoring of exsiting code, and a bit > of new helpers. > > Disclaimer: don't try to compile (or apply) - this is only intended > to give a ballpark of how the c/r patches change existing code. [...] > 159 files changed, 2031 insertions(+), 587 deletions(-) FWIW... This patch has far reaching changes which quite frankly scare me; primarily because c/r changes many long-held assumptions about how Linux processes work. It needs to track a large amount of state with lots of corner cases, and the Linux process model is already quite complex. I know this is a fluffy hand-waving critique, but without being convinced of a strong general-purpose use-case, it is hard to get excited about a solution that touches large amounts of common code. c/r of desktop processes doesn't seem interesting other that as a test case, but I can possibly be convinced about HPC, embedded, industrial, or telecom use-cases, but for custom/specific-purpose applications the question must be asked if a fully user space or joint user/kernel method would better solve the problem. You mentioned in a reply that this overview diff includes both cleanups and required changes. I suggest posting the cleanup patches as soon as possible so that this diff becomes simpler. Also: > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 9458685..335a4b3 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -93,6 +93,10 @@ config STACKTRACE_SUPPORT > config HAVE_LATENCYTOP_SUPPORT > def_bool y > > +config CHECKPOINT_SUPPORT > + bool > + default y > + Definitely should not default to 'y', and needs to be user-selectable. g. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 16:55 ` Grant Likely @ 2010-11-08 21:01 ` Nathan Lynch 2010-11-11 6:27 ` Nathan Lynch 2010-11-17 5:29 ` Anton Blanchard 2 siblings, 0 replies; 111+ messages in thread From: Nathan Lynch @ 2010-11-08 21:01 UTC (permalink / raw) To: Grant Likely Cc: Oren Laadan, ksummit-2010-discuss, Linux Kernel Mailing List, Christoph Hellwig Hi Grant, On Mon, 2010-11-08 at 11:55 -0500, Grant Likely wrote: > Also: > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > index 9458685..335a4b3 100644 > > --- a/arch/x86/Kconfig > > +++ b/arch/x86/Kconfig > > @@ -93,6 +93,10 @@ config STACKTRACE_SUPPORT > > config HAVE_LATENCYTOP_SUPPORT > > def_bool y > > > > +config CHECKPOINT_SUPPORT > > + bool > > + default y > > + > > Definitely should not default to 'y', and needs to be user-selectable. CHECKPOINT_SUPPORT is what an arch sets to indicate that it has support for C/R -- the user selectable option is in a generic location (and defaults to n). ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 16:55 ` Grant Likely 2010-11-08 21:01 ` Nathan Lynch @ 2010-11-11 6:27 ` Nathan Lynch 2010-11-17 5:29 ` Anton Blanchard 2 siblings, 0 replies; 111+ messages in thread From: Nathan Lynch @ 2010-11-11 6:27 UTC (permalink / raw) To: Grant Likely Cc: Oren Laadan, ksummit-2010-discuss, Linux Kernel Mailing List, Christoph Hellwig On Mon, 2010-11-08 at 11:55 -0500, Grant Likely wrote: > On Tue, Nov 2, 2010 at 3:30 PM, Oren Laadan <orenl@cs.columbia.edu> wrote: > > Hi, > > > > Following the discussion yesterday, here is a linux-cr diff that > > that is limited to changes to existing code. > > > > The diff doesn't include the eclone() patches. I also tried to strip > > off the new c/r code (either code in new files, or new code within > > #ifdef CONFIG_CHECKPOINT in existing files). > > > > I left a few such snippets in, e.g. c/r syscalls templates and > > declaration of c/r specific methods in, e.g. file_operations. > > > > The remaining changes in this patch include new freezer state > > ("CHECKPOINTING"), mostly refactoring of exsiting code, and a bit > > of new helpers. > > > > Disclaimer: don't try to compile (or apply) - this is only intended > > to give a ballpark of how the c/r patches change existing code. > [...] > > 159 files changed, 2031 insertions(+), 587 deletions(-) > > FWIW... > > This patch has far reaching changes which quite frankly scare me; > primarily because c/r changes many long-held assumptions about how > Linux processes work. It needs to track a large amount of state with > lots of corner cases, and the Linux process model is already quite > complex. I know this is a fluffy hand-waving critique, but without > being convinced of a strong general-purpose use-case, it is hard to > get excited about a solution that touches large amounts of common > code. For the most part the c/r patch set is "merely" adding code and not changing the way existing code works -- I'm pretty sure we haven't had to alter anything hairy like locking or object lifetime rules. Maybe I've had my head in this code for too long, but I'm not seeing how assumptions about the process model are changed significantly. All the process-related APIs like fork, clone, exec, wait, and exit all work as they have before and if you're not actively using C/R you'd never know the capability is there. As for the lack of a general-purpose use-case... well, it's not terribly unusual for Linux to sustain significant changes to satisfy what some may consider a niche need. Things like NUMA support, CPU and memory hotplug - these were not "generally" useful features when they were introduced. So I don't think we're trying to break new ground in that respect. > c/r of desktop processes doesn't seem interesting other that as a test > case, but I can possibly be convinced about HPC, embedded, industrial, > or telecom use-cases, but for custom/specific-purpose applications the > question must be asked if a fully user space or joint user/kernel > method would better solve the problem. This is in fact a joint approach -- the process tree is recreated in user space at restart (not to mention that the user is responsible for providing the restarted job a coherent view of the filesystem). In any case, with HPC, C/R isn't about just fault tolerance necessarily; it's for load-balancing and migration too. So the checkpoint operation needs to be as fast and efficient as possible, and ideally the image should be readable/writable as a stream e.g. over a socket. User space really isn't up to this - for example, a user space implementation generally cannot know which user pages are safe to omit from the image (at least not without faulting them all in). Users who need C/R on Linux today are resorting to LD_PRELOAD hacks and moribund out-of-tree kernel patches, and I'm afraid they're going to keep doing that until Linux provides a better alternative built-in. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 16:55 ` Grant Likely 2010-11-08 21:01 ` Nathan Lynch 2010-11-11 6:27 ` Nathan Lynch @ 2010-11-17 5:29 ` Anton Blanchard 2010-11-17 11:08 ` Tejun Heo ` (3 more replies) 2 siblings, 4 replies; 111+ messages in thread From: Anton Blanchard @ 2010-11-17 5:29 UTC (permalink / raw) To: Grant Likely Cc: Oren Laadan, ksummit-2010-discuss, Linux Kernel Mailing List, Christoph Hellwig, akpm, tj Hi Grant, > This patch has far reaching changes which quite frankly scare me; > primarily because c/r changes many long-held assumptions about how > Linux processes work. It needs to track a large amount of state with > lots of corner cases, and the Linux process model is already quite > complex. I know this is a fluffy hand-waving critique, but without > being convinced of a strong general-purpose use-case, it is hard to > get excited about a solution that touches large amounts of common > code. > > c/r of desktop processes doesn't seem interesting other that as a test > case, but I can possibly be convinced about HPC, embedded, industrial, > or telecom use-cases, but for custom/specific-purpose applications the > question must be asked if a fully user space or joint user/kernel > method would better solve the problem. It seems like there are a number of questions around the utility of C/R so I'd like to take a step back from the technical discussion around implementation and hopefully convince you, Tejun (and anyone else interested) that C/R is something we want to solve in Linux. Here at IBM we are working on the next generation of HPC systems. One example of this will be the NCSA Bluewaters supercomputer: http://www.ncsa.illinois.edu/BlueWaters/ The aim is not to build yet another linpack special, but a supercomputer that achieves more than 1 petaflop sustained on a wide range of applications. There is also a strong focus on improving the productivity and reliability of the cluster. There are two usage scenarios for C/R in this environment: 1. Resource management. Any large HPC cluster should be 100% busy and as such you will often fill in the gaps with low priority jobs which may need to be preempted. These low priority jobs need to give up their resources (memory, interconnect resources etc) whenever something important comes in. 2. Fault tolerance. Failures are a fact of life for any decent sized cluster. As the cluster gets larger these failures become very common. Speaking from an industry perspective, MTBF rates measured in the order of several hours for large commodity clusters are not surprising. We at IBM improve on that with hardware and system design, but there is only so much you can do. The failures also happen at the Linux kernel level so even if we had 100% reliable systems we would still have issues. Now this is the pointy end of HPC, but similar issues are happening in the meat of the HPC market. One area we are seeing a lot of C/R interest is the EDA space. As ICs become more and more complex the amount of cluster compute power it takes to route, check, create masks etc grows so large that system reliability becomes an issue. Some tool vendors write their own application C/R, but there are a multitude of in house applications that have no C/R capability today. You could argue that we should just add C/R capability to every HPC application and library people care about or rework them to be fault tolerant in software. Unfortunately I don't see either as being viable. There are so many applications, libraries and even programming languages in use for HPC that it would be a losing battle. If we did go down this route we would also be unable to leverage C/R for anything else. I can understand the concern around finding a general purpose case, but I do believe many other solid uses for C/R outside of HPC will emerge. For example, there was interest from the embedded guys during the KS discussion and I can easily imagine using C/R to bring up firefox faster on a TV. The problems found in HPC often turn into more general problems down the track. I think back to the heated discussions we had around SMP back in the early 2000s when we had 32 core POWER4s and SGI had similar sized machines. Now a 24 core machine fits in 1U and can be purchased for under $5k. NUMA support, CPU affinity and multi queue scheduling are other areas that initially had a very small user base but have since become important features for many users. Anton ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 5:29 ` Anton Blanchard @ 2010-11-17 11:08 ` Tejun Heo 2010-11-18 9:53 ` Alan Cox ` (2 subsequent siblings) 3 siblings, 0 replies; 111+ messages in thread From: Tejun Heo @ 2010-11-17 11:08 UTC (permalink / raw) To: Anton Blanchard Cc: Grant Likely, Oren Laadan, ksummit-2010-discuss, Linux Kernel Mailing List, Christoph Hellwig, akpm Hello, On 11/17/2010 06:29 AM, Anton Blanchard wrote: > It seems like there are a number of questions around the utility of > C/R so I'd like to take a step back from the technical discussion > around implementation and hopefully convince you, Tejun (and anyone > else interested) that C/R is something we want to solve in Linux. I'm not arguing CR isn't that useful. My argument was that it's a solution for a fairly niche problems and that the implementation isn't transparent at all for general use cases. > Here at IBM we are working on the next generation of HPC systems. One > example of this will be the NCSA Bluewaters supercomputer: And, yeah, I agree that it is a very useful thing for HPC. > You could argue that we should just add C/R capability to every HPC > application and library people care about or rework them to be > fault tolerant in software. Unfortunately I don't see either as being > viable. There are so many applications, libraries and even programming > languages in use for HPC that it would be a losing battle. If we > did go down this route we would also be unable to leverage C/R for > anything else. I can understand the concern around finding a general > purpose case, but I do believe many other solid uses for C/R outside of > HPC will emerge. For example, there was interest from the embedded guys > during the KS discussion and I can easily imagine using C/R to bring up > firefox faster on a TV. Thanks for pointing out the use cases although for the last one it would be much wiser to just use webkit. > The problems found in HPC often turn into more general problems down > the track. I think back to the heated discussions we had around SMP back > in the early 2000s when we had 32 core POWER4s and SGI had similar sized > machines. Now a 24 core machine fits in 1U and can be purchased for > under $5k. NUMA support, CPU affinity and multi queue scheduling are > other areas that initially had a very small user base but have since > become important features for many users. Sure, the pointy edges can discover general problems of future early. At the same time, they also encounter problems which no one else would care about ever, so in itself it isn't much of an argument. I'm no analyst and it is very difficult to foretell the future but comparing CR to SMP and NUMA doesn't seem too valid to me. In-kernel CR is sandwiched between userland CR and virtualization. Its problem space is shrinking, not expanding. Having a generally accepted standard CR implementation would certainly be very nice and I understand that CR would be a much better fit for HPC than virtualization, but I fail to see why it should be implemented in kernel when userland implementation which doesn't extend the kernel in any way already achieves most of what HPC workload requires. In this already sizeable thread, the only benefits presented seem to be the ability to cover some more corner cases and remote use cases in slightly more transparent manner. Those are very weak arguments for something as intrusive and backwards (in that it dumps kernel states in binary blobs unrestrained by ABI) as in-kernel CR and, as such, I don't really see the in-kernel CR surviving as a mainline feature. So, I think the best recourse would be identifying the specific features which would help userland CR and improve them. The in-kernel CR people have been working on the problem for a long time now and gotta know which parts are tricky and how to solve them. In fact, I don't think the work would be that widely different. It would be harder but those changes would benefit other use cases too instead of only useful for in-kernel CR. Thanks. -- tejun ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 5:29 ` Anton Blanchard 2010-11-17 11:08 ` Tejun Heo @ 2010-11-18 9:53 ` Alan Cox 2010-11-18 12:27 ` Alexey Dobriyan 2010-11-19 6:33 ` Gene Cooperman 2010-11-21 23:20 ` Grant Likely 3 siblings, 1 reply; 111+ messages in thread From: Alan Cox @ 2010-11-18 9:53 UTC (permalink / raw) To: Anton Blanchard Cc: Grant Likely, Linux Kernel Mailing List, ksummit-2010-discuss > The problems found in HPC often turn into more general problems down > the track. I think back to the heated discussions we had around SMP back > in the early 2000s when we had 32 core POWER4s and SGI had similar sized > machines. Now a 24 core machine fits in 1U and can be purchased for > under $5k. NUMA support, CPU affinity and multi queue scheduling are > other areas that initially had a very small user base but have since > become important features for many users. I'd prefer the trees to be separate for testing purposes: it doens't make much sense to have SMP support as a normal kernel feature when most people won't have SMP anyway" -- Linus Torvalds Only in this case I can't help feeling that the virtualisation work already bypassed C/R, solved the problem space that a lot of people care about and then moved on. Alan ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-18 9:53 ` Alan Cox @ 2010-11-18 12:27 ` Alexey Dobriyan 0 siblings, 0 replies; 111+ messages in thread From: Alexey Dobriyan @ 2010-11-18 12:27 UTC (permalink / raw) To: Alan Cox Cc: Anton Blanchard, Grant Likely, Linux Kernel Mailing List, ksummit-2010-discuss On Thu, Nov 18, 2010 at 11:53 AM, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote: > Only in this case I can't help feeling that the virtualisation work > already bypassed C/R, This discussion should have happened like 10 years ago. :-\ > solved the problem space that a lot of people care about and then moved on. ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 5:29 ` Anton Blanchard 2010-11-17 11:08 ` Tejun Heo 2010-11-18 9:53 ` Alan Cox @ 2010-11-19 6:33 ` Gene Cooperman 2010-11-21 23:20 ` Grant Likely 3 siblings, 0 replies; 111+ messages in thread From: Gene Cooperman @ 2010-11-19 6:33 UTC (permalink / raw) To: Anton Blanchard Cc: Grant Likely, Oren Laadan, ksummit-2010-discuss, Linux Kernel Mailing List, Christoph Hellwig, akpm, tj, Kapil Arya, Gene Cooperman > 1. Resource management. Any large HPC cluster should be 100% busy and > as such you will often fill in the gaps with low priority jobs which > may need to be preempted. These low priority jobs need to give up their > resources (memory, interconnect resources etc) whenever something > important comes in. > > 2. Fault tolerance. Failures are a fact of life for any decent sized > cluster. As the cluster gets larger these failures become very common. > Speaking from an industry perspective, MTBF rates measured in the order > of several hours for large commodity clusters are not surprising. We at > IBM improve on that with hardware and system design, but there is only > so much you can do. The failures also happen at the Linux kernel level > so even if we had 100% reliable systems we would still have issues. We have also been somewhat involved in HPC. Grant provides a nice summary of the two usage scenarios of checkpoint-restart that we have also observed. Since there is continuing discussion of HPC, I was a little surprised that there has not been more discussion of BLCR (Berkeley Lab Checkpoint/Restart). A brief introduction to BLCR follows, in case it's of interest. https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml In the HPC space, we have observed that many sites use BLCR for checkpoint-restart. BLCR is based on a kernel module, and so represents a third approach. As mentioned on the FAQ, BLCR can checkpoint/restart a process tree/group/session but has certain limitations, such as not supporting sockets, ptys, and restoring original pids on restart only if there is no collision with current pids. Nevertheless, BLCR has achieved wide usage in the HPC community. Quoting from the BLCR FAQ: Q: Does BLCR support checkpointing parallel/distributed applications? Not by itself. But by using checkpoint callbacks (see previous FAQ). some MPI implementations have made themselves checkpointable by BLCR. You can checkpoint/restart an MPI application running across an entire cluster of machines with BLCR, without any application code modifications, if you use one of these MPI implementations (listed alphabetically): * LAM/MPI 7.x or later * MPICH-V 1.0.x * MVAPICH2 0.9.8 or later * Open MPI 1.3 or later Q: Is BLCR integrated with any batch systems? We are aware of the following, but we are not always informed of new efforts to integrate with BLCR. For the most up-to-date information you should consult the support channels of your favorite batch system. * TORQUE version 2.3 and later Support for serial and parallel jobs, including periodic checkpoints and qhold/qrls. * SLURM version 2.0 and later Support for automatic (periodic) and manually requested checkpoints. * SGE (aka Sun Grid Engine) Information on configuring SGE to use BLCR can be found here. There is also a thread on the checkpoint@lbl.gov list about modifications to those instructions. The thread begins with this posting. * LSF Information on configuring LSF to use BLCR can be found in this posting on the checkpoint@lbl.gov list. * Condor Information on configuring Condor to use BLCR to checkpoint "Vanilla Universe" jobs with the help of Parrot can be found here. - Gene ^ permalink raw reply [flat|nested] 111+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 5:29 ` Anton Blanchard ` (2 preceding siblings ...) 2010-11-19 6:33 ` Gene Cooperman @ 2010-11-21 23:20 ` Grant Likely 3 siblings, 0 replies; 111+ messages in thread From: Grant Likely @ 2010-11-21 23:20 UTC (permalink / raw) To: Anton Blanchard Cc: Oren Laadan, ksummit-2010-discuss, Linux Kernel Mailing List, Christoph Hellwig, akpm, tj On Tue, Nov 16, 2010 at 10:29 PM, Anton Blanchard <anton@au1.ibm.com> wrote: > Hi Grant, [...] > There are two usage scenarios for C/R in this environment: > > 1. Resource management. Any large HPC cluster should be 100% busy and > as such you will often fill in the gaps with low priority jobs which > may need to be preempted. These low priority jobs need to give up their > resources (memory, interconnect resources etc) whenever something > important comes in. > > 2. Fault tolerance. Failures are a fact of life for any decent sized > cluster. As the cluster gets larger these failures become very common. > Speaking from an industry perspective, MTBF rates measured in the order > of several hours for large commodity clusters are not surprising. We at > IBM improve on that with hardware and system design, but there is only > so much you can do. The failures also happen at the Linux kernel level > so even if we had 100% reliable systems we would still have issues. > > Now this is the pointy end of HPC, but similar issues are happening in > the meat of the HPC market. One area we are seeing a lot of C/R > interest is the EDA space. As ICs become more and more complex the > amount of cluster compute power it takes to route, check, create masks > etc grows so large that system reliability becomes an issue. Some tool > vendors write their own application C/R, but there are a multitude of > in house applications that have no C/R capability today. I agree, and I think this is exactly the place where the discussions about c/r need to be focused (the pointy end). I don't tend to swoon at the idea of c/r'ing my desktop session because it doesn't represent a real or interesting problem for me. However, I do see the value in the scenarios described above. I have another for you; I peripherally worked on a telephone switch system that used a form of C/R for the call processing task to synchronise with a hot-standby node for uninterrupted cut-over in the event of failure. /my/ concerns are more of the, "what is the impact on the kernel?" type. > You could argue that we should just add C/R capability to every HPC > application and library people care about or rework them to be > fault tolerant in software. Unfortunately I don't see either as being > viable. There are so many applications, libraries and even programming > languages in use for HPC that it would be a losing battle. If we > did go down this route we would also be unable to leverage C/R for > anything else. Fair enough, and I do somewhat agree with this. However the question remains, what are the constraints? What are the limitations and boundaries? Oden describes the constrains on the current c/r patches. How well do those match up with the use cases discussed above? How does DMTCP match up with those use cases? > I can understand the concern around finding a general > purpose case, but I do believe many other solid uses for C/R outside of > HPC will emerge.For example, there was interest from the embedded guys > during the KS discussion and I can easily imagine using C/R to bring up > firefox faster on a TV. Heh, sounds like doing the initial-program-load (IPL) stage like I used to do on telephone switch firmware. :-) > > The problems found in HPC often turn into more general problems down > the track. I think back to the heated discussions we had around SMP back > in the early 2000s when we had 32 core POWER4s and SGI had similar sized > machines. Now a 24 core machine fits in 1U and can be purchased for > under $5k. NUMA support, CPU affinity and multi queue scheduling are > other areas that initially had a very small user base but have since > become important features for many users. > > Anton > -- Grant Likely, B.Sc., P.Eng. Secret Lab Technologies Ltd. ^ permalink raw reply [flat|nested] 111+ messages in thread
end of thread, other threads:[~2010-11-29 4:09 UTC | newest]
Thread overview: 111+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.64.1011021530470.12128@takamine.ncl.cs.columbia.edu>
2010-11-02 21:35 ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Tejun Heo
2010-11-02 21:47 ` Christoph Hellwig
2010-11-04 1:47 ` Nathan Lynch
2010-11-04 7:36 ` Tejun Heo
2010-11-04 16:04 ` Gene Cooperman
2010-11-04 20:45 ` Nathan Lynch
2010-11-06 6:48 ` Matt Helsley
2010-11-04 4:34 ` Oren Laadan
2010-11-04 14:25 ` Christoph Hellwig
2010-11-04 3:40 ` Kapil Arya
2010-11-04 8:05 ` Tejun Heo
2010-11-04 16:44 ` Gene Cooperman
2010-11-05 9:28 ` Tejun Heo
2010-11-05 23:18 ` Oren Laadan
2010-11-06 10:13 ` Tejun Heo
2010-11-06 0:36 ` Kapil Arya
2010-11-06 22:55 ` Oren Laadan
2010-11-07 19:42 ` Gene Cooperman
2010-11-07 21:30 ` Oren Laadan
2010-11-07 23:05 ` Gene Cooperman
2010-11-08 3:55 ` Oren Laadan
2010-11-08 16:26 ` Gene Cooperman
2010-11-08 18:14 ` Oren Laadan
2010-11-08 18:37 ` Gene Cooperman
2010-11-08 19:34 ` Oren Laadan
2010-11-08 19:05 ` Dan Smith
2010-11-17 11:14 ` Tejun Heo
2010-11-17 15:33 ` Dan Smith
2010-11-17 15:40 ` Tejun Heo
2010-11-17 17:04 ` Alexey Dobriyan
2010-11-17 10:45 ` Tejun Heo
2010-11-17 12:12 ` Tejun Heo
2010-11-06 5:32 ` Matt Helsley
2010-11-06 15:01 ` Oren Laadan
2010-11-06 20:40 ` Gene Cooperman
2010-11-06 22:41 ` Oren Laadan
2010-11-07 18:49 ` Gene Cooperman
2010-11-07 21:59 ` Oren Laadan
2010-11-17 11:57 ` Tejun Heo
2010-11-17 15:39 ` Serge E. Hallyn
2010-11-17 15:46 ` Tejun Heo
2010-11-18 9:13 ` Pavel Emelyanov
2010-11-18 9:48 ` Tejun Heo
2010-11-18 20:13 ` Jose R. Santos
2010-11-19 3:54 ` Serge Hallyn
2010-11-18 19:53 ` Oren Laadan
2010-11-19 4:10 ` Serge Hallyn
2010-11-19 14:04 ` Tejun Heo
2010-11-19 14:36 ` Kirill Korotaev
2010-11-19 15:33 ` Tejun Heo
2010-11-19 16:00 ` Alexey Dobriyan
2010-11-19 16:01 ` Alexey Dobriyan
2010-11-19 16:10 ` Tejun Heo
2010-11-19 16:25 ` Alexey Dobriyan
2010-11-19 16:06 ` Tejun Heo
2010-11-19 16:16 ` Alexey Dobriyan
2010-11-19 16:19 ` Tejun Heo
2010-11-19 16:27 ` Alexey Dobriyan
2010-11-19 16:32 ` Tejun Heo
2010-11-19 16:38 ` Alexey Dobriyan
2010-11-19 16:50 ` Tejun Heo
2010-11-19 16:55 ` Alexey Dobriyan
2010-11-20 17:58 ` Oren Laadan
2010-11-20 18:05 ` Oren Laadan
2010-11-20 18:08 ` Oren Laadan
2010-11-20 18:11 ` Oren Laadan
2010-11-20 18:15 ` Oren Laadan
2010-11-20 19:33 ` Tejun Heo
2010-11-21 8:18 ` Gene Cooperman
2010-11-21 8:21 ` Gene Cooperman
2010-11-22 18:02 ` Sukadev Bhattiprolu
2010-11-23 17:53 ` Oren Laadan
2010-11-24 3:50 ` Kapil Arya
2010-11-25 16:04 ` Oren Laadan
2010-11-29 4:09 ` Gene Cooperman
2010-11-21 22:41 ` Grant Likely
2010-11-22 17:34 ` Oren Laadan
2010-11-22 17:18 ` Oren Laadan
2010-11-17 22:17 ` Matt Helsley
2010-11-18 10:06 ` Tejun Heo
2010-11-18 20:25 ` Oren Laadan
2010-11-07 21:44 ` Oren Laadan
2010-11-07 23:31 ` Gene Cooperman
2010-11-05 22:24 ` Oren Laadan
2010-11-04 4:03 ` Oren Laadan
2010-11-04 9:43 ` Tejun Heo
2010-11-04 12:48 ` Luck, Tony
2010-11-04 13:06 ` Tejun Heo
2010-11-06 10:12 ` Matt Helsley
2010-11-06 11:03 ` Tejun Heo
2010-11-07 22:59 ` Davide Libenzi
2010-11-08 2:32 ` david
2010-11-18 20:41 ` Oren Laadan
2010-11-05 3:55 ` Kapil Arya
2010-11-05 11:57 ` Luck, Tony
2010-11-05 17:17 ` Gene Cooperman
2010-11-06 1:16 ` Matt Helsley
2010-11-06 4:06 ` Oren Laadan
2010-11-06 5:18 ` Matt Helsley
2010-11-06 21:00 ` Oren Laadan
2010-11-05 17:31 ` Sukadev Bhattiprolu
2010-11-06 21:05 ` Oren Laadan
2010-11-08 16:55 ` Grant Likely
2010-11-08 21:01 ` Nathan Lynch
2010-11-11 6:27 ` Nathan Lynch
2010-11-17 5:29 ` Anton Blanchard
2010-11-17 11:08 ` Tejun Heo
2010-11-18 9:53 ` Alan Cox
2010-11-18 12:27 ` Alexey Dobriyan
2010-11-19 6:33 ` Gene Cooperman
2010-11-21 23:20 ` Grant Likely
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox