* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <20101106204008.GA31077@sundance.ccs.neu.edu> @ 2010-11-07 21:44 ` Oren Laadan 2010-11-07 23:31 ` Gene Cooperman [not found] ` <4CD5D99A.8000402@cs.columbia.edu> 1 sibling, 1 reply; 49+ messages in thread From: Oren Laadan @ 2010-11-07 21:44 UTC (permalink / raw) To: Gene Cooperman Cc: Matt Helsley, Tejun Heo, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers [cc'ing linux containers mailing list] On 11/06/2010 04:40 PM, Gene Cooperman wrote: > 8. What happens if the DMTCP coordinator ( checkpoint control process) dies? > [ The same thing that happens if a user process dies. We kill the whole > computation, and restart. At restart, we use a new coordinator. > Coordinators are stateless. ] My experience is different: I downloaded dmtcp and followed the quick-start guide: (1) "dmtcp_coordinator" on one terminal (2) "dmtcp_checkpoint bash" on another terminal Then I: (3) pkill -9 dmtcp_coordinator ... oops - 'bash' died. I didn't even try to take a checkpoint :( Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-07 21:44 ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Oren Laadan @ 2010-11-07 23:31 ` Gene Cooperman 0 siblings, 0 replies; 49+ messages in thread From: Gene Cooperman @ 2010-11-07 23:31 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Matt Helsley, Tejun Heo, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers On Sun, Nov 07, 2010 at 04:44:20PM -0500, Oren Laadan wrote: > [cc'ing linux containers mailing list] > > On 11/06/2010 04:40 PM, Gene Cooperman wrote: > > >8. What happens if the DMTCP coordinator ( checkpoint control process) dies? > > [ The same thing that happens if a user process dies. We kill the whole > > computation, and restart. At restart, we use a new coordinator. > > Coordinators are stateless. ] > > My experience is different: > > I downloaded dmtcp and followed the quick-start guide: > (1) "dmtcp_coordinator" on one terminal > (2) "dmtcp_checkpoint bash" on another terminal > > Then I: > (3) pkill -9 dmtcp_coordinator > ... oops - 'bash' died. > > I didn't even try to take a checkpoint :( You're right. I just reproduced your example. But please remember that we're working in a design space where if any process of a computation dies, then we kill the computation and restart. It doesn't matter to us if it's a user process or the DMTCP coordinator that died. I do think this is getting too detailed for the LKML list, but since you bring it up, here is the analysis. The user bash process exits with: [31331] ERROR at dmtcpmessagetypes.cpp:62 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed' _magicBits = Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator die uncleanly? This means that when the DMTCP coordinator died, it sent a message to the checkpoint thread within the user process. The message was ill-formed. The current DMTCP code says that if a checkpoint thread receives an ill-formed message from the coordinator, then it should die. It's not hard to change the protocol between DMTCP coordinator and checkpoint thread of the user process into a more robust protocol with RETRY, further ACK, etc. We haven't done this. Right now, the user simply restarts from the last checkpoint. If one process of a computation has been compromised (either DMTCP coordinator or user process), then the whole computation has been compromised. I think in a previous version of DMTCP, the policy was to allow the computation to continue when the coordinator dies. Policies change. But I think you're missing the larger point. We've developed DMTCP over six years, largely with programmers who are much less experienced than the kernel developers. Yet DMTCP works reliably for many users. I consider this a credit to the DMTCP design. The Linux C/R design is also excellent. Can we get back to questions of design, using the implementations as reference implementations? If you don't object, I'll also skip replying to the other post, since I think we're getting too detailed. I'm having trouble keeping up with the posts. :-) An offline discussion will give us time to look more carefully at these issues, and draw more careful conclusions. Thanks, - Gene ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <4CD5D99A.8000402@cs.columbia.edu>]
[parent not found: <20101107184927.GF31077@sundance.ccs.neu.edu>]
[parent not found: <20101107184927.GF31077-Rl5vdzG4YPwx/1z6v04GWfZ8FUJU4vz8@public.gmane.org>]
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <20101107184927.GF31077-Rl5vdzG4YPwx/1z6v04GWfZ8FUJU4vz8@public.gmane.org> @ 2010-11-07 21:59 ` Oren Laadan 2010-11-17 11:57 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Oren Laadan @ 2010-11-07 21:59 UTC (permalink / raw) To: Gene Cooperman Cc: Kapil Arya, linux-kernel-u79uwXL29TY76Z2rM5mHXA, ksummit-2010-discuss-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Tejun Heo, Linux Containers, hch-jcswGhMUV9g [cc'ing linux containers mailing list] On 11/07/2010 01:49 PM, Gene Cooperman wrote: [snip] > Matt had asked how we would handle inotify(), but I was getting swamped > by all the questions. There is a virtualization approach to inotify in which > one puts wrappers around inotify_add_watch(), inotify_rm_watch() and > friends in the same way as we wrap open() and could wrap close(). > One would then need to wrap read() (which we don't like to do, just This sounds like reimplementation in userspace the very same logic done by the kernel :) > in case it could add significant overhead). But if we consider kernel > and userland virtualization together, then something similar to TIOCSTI > for ioctl would allow us to avoid wrapping read(). We could work to add ABIs and APIs for each and every possible piece of state that affects userspace. And for each we'll argue forever about the design and some time later regret that it wasn't designed correctly :p Even if that happens (which is very unlikely and unnecessary), it will generate all the very same code in the kernel that Tejun has been complaining about, and _more_. And we will still suffer from issues such as lack of atomicity and being unable to do many simple and advanced optimizations. Or we could use linux-cr for that: do the c/r in the kernel, keep the know-how in the kernel, expose (and commit to) a per-kernel-version ABI (not vow to keep countless new individual ABIs forever after getting them wrongly...), be able to do all sorts of useful optimization and provide atomicity and guarantees (see under "leak detection" in the OLS linux-cr paper). Also, once the c/r infrastructure is in the kernel, it will be easy (and encouraged) to support new =ly introduced features. Finally, then we would use dmtcp as well as other tools on top of the kernel-cr - and I'm looking forward to do that ! [snip] >> Hmm... can you really c/r from userspace a process that was, at >> checkpoint time, in a ptrace-stopped state at an arbitrary kernel >> ptrace-hook ? I strongly suspect the answer is "no", definitely >> not unless you also virtualize and replicate the entire in-kernel >> ptrace functionality in userspace, > > Let's try it and see. If you write a program, we'll try it out in > DMTCP (unstable branch) and see. So far, checkpointing gdb sessions > has worked well for us. If there is something we don't cover, it will > be helpful to both of us to find it, and analyze that case. Try "strace bash" :) I suspect it won't work - and for the reasons I described. [snip] >> (Now looking forward to discuss more details with dmtcp team on >> Tuesday and on :) > > Also a very good point above, and I agree. The offline discussion should > be a better forum for putting this all into perspective. > > Thanks again for your thoughtful response, Same here. Talk to you soon... Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-07 21:59 ` Oren Laadan @ 2010-11-17 11:57 ` Tejun Heo 2010-11-17 15:39 ` Serge E. Hallyn 2010-11-17 22:17 ` Matt Helsley 0 siblings, 2 replies; 49+ messages in thread From: Tejun Heo @ 2010-11-17 11:57 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Matt Helsley, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers Hello, Oren. On 11/07/2010 10:59 PM, Oren Laadan wrote: > We could work to add ABIs and APIs for each and every possible piece > of state that affects userspace. And for each we'll argue forever > about the design and some time later regret that it wasn't designed > correctly :p I'm sorry but in-kernel CR already looks like a major misdesign to me. > Even if that happens (which is very unlikely and unnecessary), > it will generate all the very same code in the kernel that Tejun > has been complaining about, and _more_. And we will still suffer > from issues such as lack of atomicity and being unable to do many > simple and advanced optimizations. It may be harder but those will be localized for specific features which would be useful for other purposes too. With in-kernel CR, you're adding a bunch of intrusive changes which can't be tested or used apart from CR. > Or we could use linux-cr for that: do the c/r in the kernel, > keep the know-how in the kernel, expose (and commit to) a > per-kernel-version ABI (not vow to keep countless new individual > ABIs forever after getting them wrongly...), be able to do all > sorts of useful optimization and provide atomicity and guarantees > (see under "leak detection" in the OLS linux-cr paper). Also, > once the c/r infrastructure is in the kernel, it will be easy > (and encouraged) to support new =ly introduced features. And the only reason it seems easier is because you're working around the ABI problem by declaring that these binary blobs wouldn't be kept compatible between different kernel versions and configurations. That simply is the wrong approach. If you want to export something, build it properly into ABI. > Finally, then we would use dmtcp as well as other tools on top > of the kernel-cr - and I'm looking forward to do that ! Yeah, this part I agree. The higher level workarounds implemented in dmtcp are quite impressive and useful no matter what happens to lower layer. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 11:57 ` Tejun Heo @ 2010-11-17 15:39 ` Serge E. Hallyn 2010-11-17 15:46 ` Tejun Heo 2010-11-17 22:17 ` Matt Helsley 1 sibling, 1 reply; 49+ messages in thread From: Serge E. Hallyn @ 2010-11-17 15:39 UTC (permalink / raw) To: Tejun Heo Cc: Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel, Matt Helsley, Linux Containers, Eric W. Biederman, xemul Quoting Tejun Heo (tj@kernel.org): > Hello, Oren. > > On 11/07/2010 10:59 PM, Oren Laadan wrote: > > We could work to add ABIs and APIs for each and every possible piece > > of state that affects userspace. And for each we'll argue forever > > about the design and some time later regret that it wasn't designed > > correctly :p > > I'm sorry but in-kernel CR already looks like a major misdesign to me. By this do you mean the very idea of having CR support in the kernel? Or our design of it in the kernel? Let's go back to July 2008, at the containers mini-summit, where it was unanimously agreed upon that the kernel was the right place (Checkpoint/Resetart [CR] under http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that we would start by supporting a single task with no resources. Was that whole discussion effectively misguided, in your opinion? Or do you feel that since the first steps outlined in that discussion we've either "gone too far" or strayed in the subsequent design? -serge ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 15:39 ` Serge E. Hallyn @ 2010-11-17 15:46 ` Tejun Heo 2010-11-18 9:13 ` Pavel Emelyanov ` (2 more replies) 0 siblings, 3 replies; 49+ messages in thread From: Tejun Heo @ 2010-11-17 15:46 UTC (permalink / raw) To: Serge E. Hallyn Cc: Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel, Matt Helsley, Linux Containers, Eric W. Biederman, xemul Hello, Serge. On 11/17/2010 04:39 PM, Serge E. Hallyn wrote: >> I'm sorry but in-kernel CR already looks like a major misdesign to me. > > By this do you mean the very idea of having CR support in the kernel? > Or our design of it in the kernel? The former, I'm afraid. > Let's go back to July 2008, at the containers mini-summit, where it > was unanimously agreed upon that the kernel was the right place > (Checkpoint/Resetart [CR] under > http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that > we would start by supporting a single task with no resources. Was > that whole discussion effectively misguided, in your opinion? Or do > you feel that since the first steps outlined in that discussion > we've either "gone too far" or strayed in the subsequent design? The conclusion doesn't seem like such a good idea, well, at least to me for what it's worth. Conclusions at summits don't carry decisive weight. It'll still have to prove its worthiness for mainline all the same and in light of already working userland alternative and the expanded area now covered by virtualization, the arguments in this thread don't seem too strong. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 15:46 ` Tejun Heo @ 2010-11-18 9:13 ` Pavel Emelyanov [not found] ` <4CE4EE21.6050305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2010-11-18 19:53 ` Oren Laadan 2010-11-19 4:10 ` Serge Hallyn 2 siblings, 1 reply; 49+ messages in thread From: Pavel Emelyanov @ 2010-11-18 9:13 UTC (permalink / raw) To: Tejun Heo Cc: Serge E. Hallyn, Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Matt Helsley, Linux Containers, Eric W. Biederman On 11/17/2010 06:46 PM, Tejun Heo wrote: > Hello, Serge. > > On 11/17/2010 04:39 PM, Serge E. Hallyn wrote: >>> I'm sorry but in-kernel CR already looks like a major misdesign to me. >> >> By this do you mean the very idea of having CR support in the kernel? >> Or our design of it in the kernel? > > The former, I'm afraid. Can you elaborate on this please? >> Let's go back to July 2008, at the containers mini-summit, where it >> was unanimously agreed upon that the kernel was the right place >> (Checkpoint/Resetart [CR] under >> http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that >> we would start by supporting a single task with no resources. Was >> that whole discussion effectively misguided, in your opinion? Or do >> you feel that since the first steps outlined in that discussion >> we've either "gone too far" or strayed in the subsequent design? > > The conclusion doesn't seem like such a good idea, well, at least to > me for what it's worth. Conclusions at summits don't carry decisive > weight. It'll still have to prove its worthiness for mainline all the > same and in light of already working userland alternative and the > expanded area now covered by virtualization, the arguments in this > thread don't seem too strong. > > Thanks. > ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <4CE4EE21.6050305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <4CE4EE21.6050305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2010-11-18 9:48 ` Tejun Heo 2010-11-18 20:13 ` Jose R. Santos 2010-11-19 3:54 ` Serge Hallyn 0 siblings, 2 replies; 49+ messages in thread From: Tejun Heo @ 2010-11-18 9:48 UTC (permalink / raw) To: Pavel Emelyanov Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Eric W. Biederman, Linux Containers, Serge E. Hallyn Hello, Pavel. On 11/18/2010 10:13 AM, Pavel Emelyanov wrote: >>> By this do you mean the very idea of having CR support in the kernel? >>> Or our design of it in the kernel? >> >> The former, I'm afraid. > > Can you elaborate on this please? I think I already did that several times in this thread but here's an attempt at summary. * It adds a bunch of pseudo ABI when most of the same information is available via already established ABI. * In a way which can only ever be used and tested by CR. If possible, kernel should provide generic mechanisms which can be used to implement features in userland. One of the reasons why we'd like to export small basic building blocks instead of full end-to-end solutions from the kernel is that we don't know how things will change in the future. In-kernel CR puts too much in the kernel in a way too inflexible manner. * It essentially adds a separate complete set of entry/exit points for a lot of things, which makes things more error prone and increases maintenance overhead across the board. * And, most of all, there are userland implementation and virtualization, making the benefit to overhead ratio completely off. Userland implementation _already_ achieves most of what's necessary for the most important use case of HPC without any special help from the kernel. The only reasonable thing to do is taking a good look at it and finding ways to improve it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-18 9:48 ` Tejun Heo @ 2010-11-18 20:13 ` Jose R. Santos 2010-11-19 3:54 ` Serge Hallyn 1 sibling, 0 replies; 49+ messages in thread From: Jose R. Santos @ 2010-11-18 20:13 UTC (permalink / raw) To: Tejun Heo Cc: Pavel Emelyanov, Serge E. Hallyn, Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Matt Helsley, Linux Containers, Eric W. Biederman On Thu, 18 Nov 2010 10:48:34 +0100 Tejun Heo <tj@kernel.org> wrote: > Hello, Pavel. > > On 11/18/2010 10:13 AM, Pavel Emelyanov wrote: > >>> By this do you mean the very idea of having CR support in the > >>> kernel? Or our design of it in the kernel? > >> > >> The former, I'm afraid. > > > > Can you elaborate on this please? > > I think I already did that several times in this thread but here's an > attempt at summary. Yet the arguments seem to be vague enough not to be convincing to the people working on the code. > * It adds a bunch of pseudo ABI when most of the same information is > available via already established ABI. Can you elaborate on this? What established ABI are you proposing we use here. Hopefully we can turn this into a more technical discussion. > * In a way which can only ever be used and tested by CR. If possible, So what if it can only be tested with CR as long as we can make CR work on a variety of environments? Scalability changes for _really_ large SMP boxes can only be reliably tested by people such equipment. We are not imposing any such restriction and this code can be tested on very wide range of setups. > kernel should provide generic mechanisms which can be used to > implement features in userland. One of the reasons why we'd like to > export small basic building blocks instead of full end-to-end > solutions from the kernel is that we don't know how things will > change in the future. In-kernel CR puts too much in the kernel in a > way too inflexible manner. > > * It essentially adds a separate complete set of entry/exit points for > a lot of things, which makes things more error prone and increases > maintenance overhead across the board. I partially agree with you here. There will be maintenance overhead every time you add code to the kernel that _may_ make changes in the future more complicated. This true for _any_ code that is added to the core kernel. Now in my experience such maintenance burden is most disruptive when the code being added creates a lot of new state that need to be tracked in multiple places unrelated to CR (in this case). Our argument is that the CR code is not creating new state that will cause painful future changes to the kernel. If you have specific example that you are concerned with, great. Lets discuss those. Are we promising zero maintenance cost? But guess what, neither do most features that make into the kernel. Now, if we change the argument around... What would be the maintenance cost keeping this outside the kernel. I would argue that it is much higher and would use SystemTap as the first example that come to mind. > * And, most of all, there are userland implementation and > virtualization, making the benefit to overhead ratio completely off. Can we keep virtualization out of this. Every time someone mentions virtualization as a solution, it makes me feel like these people just don't understand the problem we are trying to solve. It is just not practical to create a new VM for every application you want to CR. These are two different tools to attack two different problems. > Userland implementation _already_ achieves most of what's necessary > for the most important use case of HPC without any special help from What are these _most_ important cases of HPC that you are referring too? Can we do a lot of these cases from userspace? Sure, but why are the ones that can't be done from userspace any less important. If nobody cared about those, we would not be having this conversation. > the kernel. The only reasonable thing to do is taking a good look > at it and finding ways to improve it. The userspace vs in-kernel discussion has been done before as multiple people have already said in this thread. Show me a version of userspace CR that can correctly do all that an in-kernel implementation is capable of. > Thanks. > -- Jose R. Santos ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-18 9:48 ` Tejun Heo 2010-11-18 20:13 ` Jose R. Santos @ 2010-11-19 3:54 ` Serge Hallyn 1 sibling, 0 replies; 49+ messages in thread From: Serge Hallyn @ 2010-11-19 3:54 UTC (permalink / raw) To: Tejun Heo Cc: Pavel Emelyanov, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Eric W. Biederman, Linux Containers Quoting Tejun Heo (tj@kernel.org): > * And, most of all, there are userland implementation and > virtualization, making the benefit to overhead ratio completely off. > Userland implementation _already_ achieves most of what's necessary Guess I'll just be offensive here and say, straight-out: I don't believe it. Can I see the userspace implementation of c/r? If it's as good as the kernel level c/r, then aweseome - we don't need the kernel patches. If it's not as good, then the thing is, we're not drawing arbitrary lines saying "is this good enough", rather we want completely reliable and transparent c/r. IOW, the running task and the other end can't tell that a migration happened, and, if checkpoint says it worked, then restart must succeed. -serge ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 15:46 ` Tejun Heo 2010-11-18 9:13 ` Pavel Emelyanov @ 2010-11-18 19:53 ` Oren Laadan 2010-11-19 4:10 ` Serge Hallyn 2 siblings, 0 replies; 49+ messages in thread From: Oren Laadan @ 2010-11-18 19:53 UTC (permalink / raw) To: Tejun Heo Cc: Serge E. Hallyn, Kapil Arya, Gene Cooperman, linux-kernel, Matt Helsley, Linux Containers, Eric W. Biederman, xemul On 11/17/2010 10:46 AM, Tejun Heo wrote: > Hello, Serge. > > On 11/17/2010 04:39 PM, Serge E. Hallyn wrote: >>> I'm sorry but in-kernel CR already looks like a major misdesign to me. >> >> By this do you mean the very idea of having CR support in the kernel? >> Or our design of it in the kernel? > > The former, I'm afraid. > >> Let's go back to July 2008, at the containers mini-summit, where it >> was unanimously agreed upon that the kernel was the right place >> (Checkpoint/Resetart [CR] under >> http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that >> we would start by supporting a single task with no resources. Was >> that whole discussion effectively misguided, in your opinion? Or do >> you feel that since the first steps outlined in that discussion >> we've either "gone too far" or strayed in the subsequent design? > > The conclusion doesn't seem like such a good idea, well, at least to > me for what it's worth. Conclusions at summits don't carry decisive > weight. It'll still have to prove its worthiness for mainline all the > same and in light of already working userland alternative and the > expanded area now covered by virtualization, the arguments in this > thread don't seem too strong. While it's your opinion that userland alternatives "already work", in reality they are unsuitable for several real use-cases. The userland approach has serious restrictions - which I will cover in a follow-up post to my discussion with Gene soon. Note that one important point of agreement was that DMTCP's ability to provide "glue" to restart applications without their original context is _orthogonal_ to how the core c/r is done. IOW - there exciting goodies from DMTCP are useful with either form of c/r. You also argue that "virtualization" (VMs?) covers everything else, implying that lightweight virtualization is useless. In reality it is an important technology, already in the kernel (surely you don't suggest to pull it out ?!) and for a reason. That is already a very good reason to provide, e.g. containers c/r and live-migration to keep it competitive and useful. Thanks, Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 15:46 ` Tejun Heo 2010-11-18 9:13 ` Pavel Emelyanov 2010-11-18 19:53 ` Oren Laadan @ 2010-11-19 4:10 ` Serge Hallyn 2010-11-19 14:04 ` Tejun Heo 2 siblings, 1 reply; 49+ messages in thread From: Serge Hallyn @ 2010-11-19 4:10 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Linux Containers, Eric W. Biederman Quoting Tejun Heo (tj@kernel.org): > Hello, Serge. Hey Tejun :) > On 11/17/2010 04:39 PM, Serge E. Hallyn wrote: > >> I'm sorry but in-kernel CR already looks like a major misdesign to me. > > > > By this do you mean the very idea of having CR support in the kernel? > > Or our design of it in the kernel? > > The former, I'm afraid. > > > Let's go back to July 2008, at the containers mini-summit, where it > > was unanimously agreed upon that the kernel was the right place > > (Checkpoint/Resetart [CR] under > > http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that > > we would start by supporting a single task with no resources. Was > > that whole discussion effectively misguided, in your opinion? Or do > > you feel that since the first steps outlined in that discussion > > we've either "gone too far" or strayed in the subsequent design? > > The conclusion doesn't seem like such a good idea, well, at least to > me for what it's worth. Conclusions at summits don't carry decisive > weight. Of course. It allows us to present at kernel summit and look for early rejections to save us all some time (which we did, at the container mini-summit readout at ksummit 2008), but it would be silly to read anything more into it than that. > It'll still have to prove its worthiness for mainline all the > same 100% agreed. > and in light of already working userland alternative and the Here's where we disagree. If you are right about a viable userland alternative ('already working' isn't even a preqeq in my opinion, so long as it is really viable), then I'm with you, but I'm not buying it at this point. Seriously. Truly. Honestly. I am *not* looking for any extra kernel work at this moment, if we can help it in any way. > expanded area now covered by virtualization, the arguments in this > thread don't seem too strong. -serge ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 4:10 ` Serge Hallyn @ 2010-11-19 14:04 ` Tejun Heo 2010-11-20 18:05 ` Oren Laadan [not found] ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 0 siblings, 2 replies; 49+ messages in thread From: Tejun Heo @ 2010-11-19 14:04 UTC (permalink / raw) To: Serge Hallyn Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Linux Containers, Eric W. Biederman On 11/19/2010 05:10 AM, Serge Hallyn wrote: > Hey Tejun :) Hey, :-) >> and in light of already working userland alternative and the > > Here's where we disagree. If you are right about a viable userland > alternative ('already working' isn't even a preqeq in my opinion, > so long as it is really viable), then I'm with you, but I'm not buying > it at this point. > > Seriously. Truly. Honestly. I am *not* looking for any extra kernel > work at this moment, if we can help it in any way. What's so wrong with Gene's work? Sure, it has some hacky aspects but let's fix those up. To me, it sure looks like much saner and manageable approach than in-kernel CR. We can add nested ptrace, CLONE_SET_PID (or whatever) in pidns, integrate it with various ns supports, add an ability to adjust brk, export inotify state via fdinfo and so on. The thing is already working, the codebase of core part is fairly small and condor is contemplating integrating it, so at least some people in HPC segment think it's already viable. Maybe the HPC cluster I'm currently sitting near is special case but people here really don't run very fancy stuff. In most cases, they're fairly simple (from system POV) C programs reading/writing data and burning a _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp integrated with condor would work well enough for them. Sure, in-kernel CR has better or more reliable coverage now but by how much? The basic things are already there in userland. The tradeoff simply doesn't make any sense. If it were a well separated self sustained feature, it probably would be able to get in, but it's all over the place and requires a completely new concept - the quasi-ABI'ish binary blob which would probably be portable across different kernel versions with some massaging. I personally think the idea is fundamentally flawed (just go through the usual ABI!) but even if it were not it would require _MUCH_ stronger rationale than it currently has to be even considered for mainline inclusion. Maybe it's just me but most of the arguments for in-kernel CR look very weak. They're either about remote toy use cases or along the line that userland CR currently doesn't do everything kernel CR does (yet). Even if it weren't for me, I frankly can't see how it would be included in mainline. I think it would be best for everyone to improve userland CR. A lot of knowdledge and experience gained through kernel CR would be applicable and won't go wasted. Strong resistance against direction change certainly is understandable but IMHO pushing the current direction would only increase loss. I of course could be completely wrong and might end up getting mails filled up with megabytes of "told you so" later, but, well, at this point, in-kernel CR already looks half dead to me. Thank you. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 14:04 ` Tejun Heo @ 2010-11-20 18:05 ` Oren Laadan [not found] ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 1 sibling, 0 replies; 49+ messages in thread From: Oren Laadan @ 2010-11-20 18:05 UTC (permalink / raw) To: Tejun Heo Cc: Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel, xemul, Eric W. Biederman, Linux Containers Hi, Based on discussion with Gene, I'd like to clarify key points and difference between kernel and userspace approaches (specifically linux-cr and dmtcp): three parts to break the long post... part I: perpsectice about the types of scopes of c/r in discussion part II: linux-cr design adn objectives part III: comparison kernel/userspace approaches [now relax, grab (another) cup of coffee and read on...] PART I: ==PERSPECTIVE== A rough classification of c/r categories: * container-c/r: important use-case, e.g. c/r and migration of an application containers like VPS (virtual private server), VDI (desktop) or other self-contained application (e.g. Oracle server). Here _all_ the relevant processes are included in the checkpoint. * standalone-c/r: another use-case is standalone-c/r where a set of processes is checkpointed, but not the entire environment, and then those processes are restarted in a different "eco-system". * distributed-c/r: meaning several sets of processes, each running on a different host. (Each set may be a separate container there). In container-c/r, the main challenge is to be _reliable_ in the sense that a restart from a successful checkpoint should always succeed. In standalone-c/r, the main challenge is that an application resumes execution after a restart in a possible _different_ eco-system. Some application don't care (e.g 'bc'). Other applications do care, and to different degrees; for these we need "glue" to pacify the application. There are generally three types of "glue": (1) Modify the application or selected libraries to be c/r-aware, and notify it when restart completes. (e.g. CoCheck MPI library). (2) Add a userspace helper that will run post-restart to do necessary trickery (eg. send a SIGWINCH to 'screen'; mount proper filesystem at the new host after migration; reconnect a socket to a peer). (3) Use interposition on selected library calls and add wrapper code that will glue in what's missing (e.g. dbus or nscd calls to reconnect an application to those services). IMPORTANT: the glueing method is _orthogonal_ to how the c/r is done ! We are strictly discussion the core c/r functionality. (next part: linux-cr philosophy...) Thanks, Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>]
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> @ 2010-11-19 14:36 ` Kirill Korotaev [not found] ` <04F4899E-B5C7-4BAF-8F2F-05D507A91408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2010-11-20 18:08 ` Oren Laadan 2010-11-20 18:11 ` Oren Laadan 2 siblings, 1 reply; 49+ messages in thread From: Kirill Korotaev @ 2010-11-19 14:36 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, Pavel Emelianov, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Eric W. Biederman, Linux Containers Tejun, Sorry for getting into the middle of the discussion, but... Can you imagine how many userland APIs are needed to make userspace C/R? Do you really want APIs in user-space which allow to: - send signals with siginfo attached (kill() doesn't work...) - read inotify configuration - insert SKB's into socket buffers - setup all TCP/IP parameters for sockets - wait for AIO pending in other processes - setting different statistics counters (like netdev stats etc.) and so on... For every small piece of functionality you will need to export ABI and maintain it forever. It's thousands of APIs! And why the hell they are needed in user space at all? BTW, HPC case you are talking about is probably the simplest one. Last time I looked into it, IBM Meiosis c/r didn't even bother with tty's migration. In OpenVZ we really do need much more then that like autofs/NFS support, preserve statistics, TTYs, etc. etc. etc. Thanks, Kirill On Nov 19, 2010, at 17:04 , Tejun Heo wrote: > On 11/19/2010 05:10 AM, Serge Hallyn wrote: >> Hey Tejun :) > > Hey, :-) > >>> and in light of already working userland alternative and the >> >> Here's where we disagree. If you are right about a viable userland >> alternative ('already working' isn't even a preqeq in my opinion, >> so long as it is really viable), then I'm with you, but I'm not buying >> it at this point. >> >> Seriously. Truly. Honestly. I am *not* looking for any extra kernel >> work at this moment, if we can help it in any way. > > What's so wrong with Gene's work? Sure, it has some hacky aspects but > let's fix those up. To me, it sure looks like much saner and > manageable approach than in-kernel CR. We can add nested ptrace, > CLONE_SET_PID (or whatever) in pidns, integrate it with various ns > supports, add an ability to adjust brk, export inotify state via > fdinfo and so on. > > The thing is already working, the codebase of core part is fairly > small and condor is contemplating integrating it, so at least some > people in HPC segment think it's already viable. Maybe the HPC > cluster I'm currently sitting near is special case but people here > really don't run very fancy stuff. In most cases, they're fairly > simple (from system POV) C programs reading/writing data and burning a > _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp > integrated with condor would work well enough for them. > > Sure, in-kernel CR has better or more reliable coverage now but by how > much? The basic things are already there in userland. The tradeoff > simply doesn't make any sense. If it were a well separated self > sustained feature, it probably would be able to get in, but it's all > over the place and requires a completely new concept - the > quasi-ABI'ish binary blob which would probably be portable across > different kernel versions with some massaging. I personally think the > idea is fundamentally flawed (just go through the usual ABI!) but even > if it were not it would require _MUCH_ stronger rationale than it > currently has to be even considered for mainline inclusion. > > Maybe it's just me but most of the arguments for in-kernel CR look > very weak. They're either about remote toy use cases or along the > line that userland CR currently doesn't do everything kernel CR does > (yet). Even if it weren't for me, I frankly can't see how it would be > included in mainline. > > I think it would be best for everyone to improve userland CR. A lot > of knowdledge and experience gained through kernel CR would be > applicable and won't go wasted. Strong resistance against direction > change certainly is understandable but IMHO pushing the current > direction would only increase loss. I of course could be completely > wrong and might end up getting mails filled up with megabytes of "told > you so" later, but, well, at this point, in-kernel CR already looks > half dead to me. > > Thank you. > > -- > tejun > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <04F4899E-B5C7-4BAF-8F2F-05D507A91408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <04F4899E-B5C7-4BAF-8F2F-05D507A91408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2010-11-19 15:33 ` Tejun Heo 2010-11-19 16:00 ` Alexey Dobriyan 2010-11-20 17:58 ` Oren Laadan 0 siblings, 2 replies; 49+ messages in thread From: Tejun Heo @ 2010-11-19 15:33 UTC (permalink / raw) To: Kirill Korotaev Cc: Kapil Arya, Pavel Emelianov, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Eric W. Biederman, Linux Containers Hello, On 11/19/2010 03:36 PM, Kirill Korotaev wrote: > Can you imagine how many userland APIs are needed to make userspace C/R? > > Do you really want APIs in user-space which allow to: > - send signals with siginfo attached (kill() doesn't work...) Doesn't rt_sigqueueinfo() already do this? > - read inotify configuration This would be nice even apart from CR. > - insert SKB's into socket buffers Can't we drain kernel buffers? ie. Stop further writing and wait the send-q to drop to zero. > - setup all TCP/IP parameters for sockets I _think_ most can be restored by talking to netfilter module. Setting outgoing sequence number might be beneficial tho. > - wait for AIO pending in other processes I haven't looked at aio implementation for a while now but can't we drain these upon checkpointing and just carry the completion status? Also, if aio is what you're concerned about, I would say the problem is mostly solved. > - setting different statistics counters (like netdev stats etc.) > and so on... Why would this matter? > For every small piece of functionality you will need to export ABI > and maintain it forever. It's thousands of APIs! And why the hell > they are needed in user space at all? I think it's actually quite the contrary. Most things are already visible to userland. They _have_ to be and that's the reason why userland implementation can already get most things working without any change to the kernel with some amount of hackery. To me in-kernel CR seems to approach the problem from the exactly wrong direction - rather than dealing with specific exceptions, it create a completely new framework which is very foreign and not useful outside of CR. Also, think about it. Which one is better? A kernel which can fully show its ABI visible states to userland or one which dumps its internal data structurs in binary blobs. To me, the latter seems multiple orders of magnitude uglier. > BTW, HPC case you are talking about is probably the simplest > one. Yet, it is one of the the most important / relevant use cases. > Last time I looked into it, IBM Meiosis c/r didn't even bother with > tty's migration. In OpenVZ we really do need much more then that > like autofs/NFS support, preserve statistics, TTYs, etc. etc. etc. Would it be impossible to preserve autofs/NFS and TTYs from userland? Then, why so? For statistics, I'm a bit lost. Why does it matter and even if it does would it justify putting the whole CR inside kernel? Thank you. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 15:33 ` Tejun Heo @ 2010-11-19 16:00 ` Alexey Dobriyan 2010-11-19 16:01 ` Alexey Dobriyan 2010-11-19 16:06 ` Tejun Heo 2010-11-20 17:58 ` Oren Laadan 1 sibling, 2 replies; 49+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:00 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 5:33 PM, Tejun Heo <tj@kernel.org> wrote: >> - insert SKB's into socket buffers > > Can't we drain kernel buffers? ie. Stop further writing and wait the > send-q to drop to zero. On send: if network dies right after freeze, you lose. On receive: packets arrive after process freeze, but before network device freeze. >> - setting different statistics counters (like netdev stats etc.) >> and so on... > > Why would this matter? Because you'll introduce million stupid interfaces not interesting to anyone but C/R. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:00 ` Alexey Dobriyan @ 2010-11-19 16:01 ` Alexey Dobriyan 2010-11-19 16:10 ` Tejun Heo 2010-11-19 16:06 ` Tejun Heo 1 sibling, 1 reply; 49+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:01 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:00 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote: >>> - setting different statistics counters (like netdev stats etc.) >>> and so on... >> >> Why would this matter? > > Because you'll introduce million stupid interfaces not interesting to > anyone but C/R. Just like CLONE_SET_PID. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:01 ` Alexey Dobriyan @ 2010-11-19 16:10 ` Tejun Heo 2010-11-19 16:25 ` Alexey Dobriyan 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2010-11-19 16:10 UTC (permalink / raw) To: Alexey Dobriyan Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On 11/19/2010 05:01 PM, Alexey Dobriyan wrote: > On Fri, Nov 19, 2010 at 6:00 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote: >>>> - setting different statistics counters (like netdev stats etc.) >>>> and so on... >>> >>> Why would this matter? >> >> Because you'll introduce million stupid interfaces not interesting to >> anyone but C/R. > > Just like CLONE_SET_PID. Well, if you ask me, having pidns w/o a way to reinstate PID from userland is pretty silly and you and I might not know yet but it's quite imaginable that there will be other use cases for the capability unlike in-kernel CR. Kernel provides building blocks not the whole frigging package and for very good reasons. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:10 ` Tejun Heo @ 2010-11-19 16:25 ` Alexey Dobriyan 0 siblings, 0 replies; 49+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:25 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:10 PM, Tejun Heo <tj@kernel.org> wrote: > Well, if you ask me, having pidns w/o a way to reinstate PID from > userland is pretty silly No. Chrome uses CLONE_PID so that exploit couldn't attach to processes in parent pidns. > and you and I might not know yet but it's > quite imaginable that there will be other use cases for the capability > unlike in-kernel CR. Kernel provides building blocks not the whole > frigging package and for very good reasons. Speaking of pids, pid's value itself is never interesing (except maybe pid 1). It's a cookie. CLONE_SET_PID came up only now because only C/R wants it. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:00 ` Alexey Dobriyan 2010-11-19 16:01 ` Alexey Dobriyan @ 2010-11-19 16:06 ` Tejun Heo 2010-11-19 16:16 ` Alexey Dobriyan 1 sibling, 1 reply; 49+ messages in thread From: Tejun Heo @ 2010-11-19 16:06 UTC (permalink / raw) To: Alexey Dobriyan Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers Hello, On 11/19/2010 05:00 PM, Alexey Dobriyan wrote: > On Fri, Nov 19, 2010 at 5:33 PM, Tejun Heo <tj@kernel.org> wrote: >>> - insert SKB's into socket buffers >> >> Can't we drain kernel buffers? ie. Stop further writing and wait the >> send-q to drop to zero. > > On send: > if network dies right after freeze, you lose. Gosh, if you're really worried about that, put a netfilter module which would buffer and simulate acks to extract the packets before initiating freeze. These are fringe problems. Use fringe solutions. > On receive: > packets arrive after process freeze, but before network device freeze. Just store the data somewhere. The checkpointer can drain the socket, right? >>> - setting different statistics counters (like netdev stats etc.) >>> and so on... >> >> Why would this matter? > > Because you'll introduce million stupid interfaces not interesting to > anyone but C/R. In this thread, how many have you guys come up with? Not even a dozen and most can be sovled almost trivially. Seriously, what the hell.. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:06 ` Tejun Heo @ 2010-11-19 16:16 ` Alexey Dobriyan 2010-11-19 16:19 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:16 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:06 PM, Tejun Heo <tj@kernel.org> wrote: >>>> - setting different statistics counters (like netdev stats etc.) >>>> and so on... >>> >>> Why would this matter? >> >> Because you'll introduce million stupid interfaces not interesting to >> anyone but C/R. > > In this thread, how many have you guys come up with? Not even a dozen > and most can be sovled almost trivially. Seriously, what the hell.. I do not count them. The paragon of absurdity is struct task_struct::did_exec . ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:16 ` Alexey Dobriyan @ 2010-11-19 16:19 ` Tejun Heo 2010-11-19 16:27 ` Alexey Dobriyan 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2010-11-19 16:19 UTC (permalink / raw) To: Alexey Dobriyan Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On 11/19/2010 05:16 PM, Alexey Dobriyan wrote: > On Fri, Nov 19, 2010 at 6:06 PM, Tejun Heo <tj@kernel.org> wrote: >>>>> - setting different statistics counters (like netdev stats etc.) >>>>> and so on... >>>> >>>> Why would this matter? >>> >>> Because you'll introduce million stupid interfaces not interesting to >>> anyone but C/R. >> >> In this thread, how many have you guys come up with? Not even a dozen >> and most can be sovled almost trivially. Seriously, what the hell.. > > I do not count them. > > The paragon of absurdity is struct task_struct::did_exec . Yeah, then go and figure how to do that in a way which would be useful for other purposes too instead of trying to shove the whole checkpointer inside the kernel. It sure would be harder but hey that's the way it is. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:19 ` Tejun Heo @ 2010-11-19 16:27 ` Alexey Dobriyan [not found] ` <AANLkTin7kd3crS+fTLLea5PhAii7B3dz=n7p7YtQ6d4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 49+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:27 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote: >> The paragon of absurdity is struct task_struct::did_exec . > > Yeah, then go and figure how to do that in a way which would be useful > for other purposes too instead of trying to shove the whole > checkpointer inside the kernel. It sure would be harder but hey > that's the way it is. System call for one bit? This is ridiculous. Doing execve(2) for userspace C/R is ridicoulous too (and likely doesn't work). ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <AANLkTin7kd3crS+fTLLea5PhAii7B3dz=n7p7YtQ6d4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <AANLkTin7kd3crS+fTLLea5PhAii7B3dz=n7p7YtQ6d4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2010-11-19 16:32 ` Tejun Heo 2010-11-19 16:38 ` Alexey Dobriyan 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2010-11-19 16:32 UTC (permalink / raw) To: Alexey Dobriyan Cc: Kapil Arya, Kirill Korotaev, Pavel Emelianov, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Eric W. Biederman, Linux Containers On 11/19/2010 05:27 PM, Alexey Dobriyan wrote: > On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: >>> The paragon of absurdity is struct task_struct::did_exec . >> >> Yeah, then go and figure how to do that in a way which would be useful >> for other purposes too instead of trying to shove the whole >> checkpointer inside the kernel. It sure would be harder but hey >> that's the way it is. > > System call for one bit? This is ridiculous. Why not just a flag in proc entry? It's a frigging single bit. > Doing execve(2) for userspace C/R is ridicoulous too (and likely > doesn't work). Really, whatever. Just keep doing what you're doing. Hey, if it makes you happy, it can't be too wrong. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:32 ` Tejun Heo @ 2010-11-19 16:38 ` Alexey Dobriyan 2010-11-19 16:50 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:38 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <tj@kernel.org> wrote: > On 11/19/2010 05:27 PM, Alexey Dobriyan wrote: >> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote: >>>> The paragon of absurdity is struct task_struct::did_exec . >>> >>> Yeah, then go and figure how to do that in a way which would be useful >>> for other purposes too instead of trying to shove the whole >>> checkpointer inside the kernel. It sure would be harder but hey >>> that's the way it is. >> >> System call for one bit? This is ridiculous. > > Why not just a flag in proc entry? It's a frigging single bit. Because /proc/*/did_exec useless to anyone but C/R (even for reading!). Because code is much simpler: tsk->did_exec = !!tsk_img->did_exec; + __u8 did_exec; ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:38 ` Alexey Dobriyan @ 2010-11-19 16:50 ` Tejun Heo 2010-11-19 16:55 ` Alexey Dobriyan 0 siblings, 1 reply; 49+ messages in thread From: Tejun Heo @ 2010-11-19 16:50 UTC (permalink / raw) To: Alexey Dobriyan Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On 11/19/2010 05:38 PM, Alexey Dobriyan wrote: > On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <tj@kernel.org> wrote: >> On 11/19/2010 05:27 PM, Alexey Dobriyan wrote: >>> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote: >>>>> The paragon of absurdity is struct task_struct::did_exec . >>>> >>>> Yeah, then go and figure how to do that in a way which would be useful >>>> for other purposes too instead of trying to shove the whole >>>> checkpointer inside the kernel. It sure would be harder but hey >>>> that's the way it is. >>> >>> System call for one bit? This is ridiculous. >> >> Why not just a flag in proc entry? It's a frigging single bit. > > Because /proc/*/did_exec useless to anyone but C/R (even for reading!). I don't think you'll need a full file. Just shove it in status or somewhere. Your argument is completely absurd. So, because exporting single bit is so horrible to everyone else, you want to shove the whole frigging checkpointer inside the kernel? > Because code is much simpler: > > tsk->did_exec = !!tsk_img->did_exec; > + > __u8 did_exec; Sigh, yeah, except for the horror show to create tsk_img. Your "paragon of absurdity" is did_exec which is only ever used to decide whether setpgid() should fail with -EACCES, seriously? Here's a thought. Ignore it for now and concentrate on more relevant problems. I'm fairly sure CR'd program malfunctioning over did_exec wouldn't mark the beginning of the end of our civilization. You gotta be kidding me. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 16:50 ` Tejun Heo @ 2010-11-19 16:55 ` Alexey Dobriyan 0 siblings, 0 replies; 49+ messages in thread From: Alexey Dobriyan @ 2010-11-19 16:55 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman, Linux Containers On Fri, Nov 19, 2010 at 6:50 PM, Tejun Heo <tj@kernel.org> wrote: > On 11/19/2010 05:38 PM, Alexey Dobriyan wrote: >> On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <tj@kernel.org> wrote: >>> On 11/19/2010 05:27 PM, Alexey Dobriyan wrote: >>>> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote: >>>>>> The paragon of absurdity is struct task_struct::did_exec . >>>>> >>>>> Yeah, then go and figure how to do that in a way which would be useful >>>>> for other purposes too instead of trying to shove the whole >>>>> checkpointer inside the kernel. It sure would be harder but hey >>>>> that's the way it is. >>>> >>>> System call for one bit? This is ridiculous. >>> >>> Why not just a flag in proc entry? It's a frigging single bit. >> >> Because /proc/*/did_exec useless to anyone but C/R (even for reading!). > > I don't think you'll need a full file. Just shove it in status or > somewhere. Your argument is completely absurd. So, because exporting > single bit is so horrible to everyone else, you want to shove the > whole frigging checkpointer inside the kernel? > >> Because code is much simpler: >> >> tsk->did_exec = !!tsk_img->did_exec; >> + >> __u8 did_exec; > > Sigh, yeah, except for the horror show to create tsk_img. task_struct image work is common for both userspace C/R and in-kernel. You _have_ to define it. Simpler code is only first line. > Your "paragon of absurdity" is did_exec which is only ever used > to decide whether setpgid() should fail with -EACCES, seriously? > Here's a thought. Ignore it for now and concentrate on more > relevant problems. You're so newjerseyly now. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-19 15:33 ` Tejun Heo 2010-11-19 16:00 ` Alexey Dobriyan @ 2010-11-20 17:58 ` Oren Laadan 1 sibling, 0 replies; 49+ messages in thread From: Oren Laadan @ 2010-11-20 17:58 UTC (permalink / raw) To: Tejun Heo Cc: Kirill Korotaev, Kapil Arya, Pavel Emelianov, Gene Cooperman, linux-kernel@vger.kernel.org, Eric W. Biederman, Linux Containers On Fri, 19 Nov 2010, Tejun Heo wrote: > Hello, > > On 11/19/2010 03:36 PM, Kirill Korotaev wrote: > > Can you imagine how many userland APIs are needed to make userspace C/R? > > > > Do you really want APIs in user-space which allow to: > > - send signals with siginfo attached (kill() doesn't work...) > > Doesn't rt_sigqueueinfo() already do this? > You assume that c/r is done by the checkpointed processes _themselves_, that is that to checkpoint a process that process need to be made runnable and it will save its own state (which is the model of dmtcp, but not of using ptrace). This model is restrictive: it requires that you hijack the execution of that process somehow and make it run. What if the process isn't runnable (e.g. in vfork waiting for completion, or ptraced deep in the kernel) ? letting it run even just a bit may modify its state. It also means that if you have many processes in the checkpointed session, e.g. 1000, then _all_ of them will have to be scheduled to run ! With kernel c/r this is unnecessary: you can use an auxiliary process to checkpoint other processes without scheduling the other processes. I.e. it's _transparent_ and _preemptive_. Another advantage is that if anything fails during checkpoint (for whatever reason), there are no side-effects (which is not the case with the other method). > > For every small piece of functionality you will need to export ABI > > and maintain it forever. It's thousands of APIs! And why the hell > > they are needed in user space at all? > > I think it's actually quite the contrary. Most things are already > visible to userland. They _have_ to be and that's the reason why > userland implementation can already get most things working without > any change to the kernel with some amount of hackery. To me in-kernel > CR seems to approach the problem from the exactly wrong direction - > rather than dealing with specific exceptions, it create a completely > new framework which is very foreign and not useful outside of CR. > > Also, think about it. Which one is better? A kernel which can fully > show its ABI visible states to userland or one which dumps its > internal data structurs in binary blobs. To me, the latter seems > multiple orders of magnitude uglier. Are we jusding aesteics ? To me the former looks uglier... The amount of fragile hacks you need to go through to make it work in userspace for the generic cases (including userspace trickery and new crazy APIs from the kernel for state that was never even an ABI, like skb's), and the restrictions it posses simply suggest that userspace is not the right place to do it. Thanks, Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2010-11-19 14:36 ` Kirill Korotaev @ 2010-11-20 18:08 ` Oren Laadan 2010-11-20 18:11 ` Oren Laadan 2 siblings, 0 replies; 49+ messages in thread From: Oren Laadan @ 2010-11-20 18:08 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA, xemul-3ImXcnM4P+0, Eric W. Biederman, Linux Containers login as: orenl Using keyboard-interactive authentication. Password: Access denied Using keyboard-interactive authentication. Password: Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il 499:takamine[~]$ pine PINE 4.64 COMPOSE MESSAGE Folder: Drafts 8 Messages + To : Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc : Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>, Kapil Arya <kapil-1vnkWVZi4QaVc3sceRu5cw@public.gmane.org>, Gene Cooperman <gene-1vnkWVZi4QaVc3sceRu5cw@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, xemul-3ImXcnM4P+0@public.gmane.org, "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>, Linux Containers <containers-qjLDD68F18O7TbgM5vRIOg@public.gmane.org> Attchmnt: Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch ----- Message Text ----- Hi, [continuation of posting regarding kernel vs userspace approach] part I: perpsectice about the types of scopes of c/r in discussion part II: linux-cr design adn objectives part III: comparison kernel/userspace approaches PART II: ==PHILOSOPHY== Linux-cr is a _generic_ c/r-engine with multiple capabilities. It can checkpoint a full container, a process hierarchy, or a single process, For containers, it provides guarantees like restart-ability; For the others, it provides the flexibility so that c/r-aware applications, libraries, helpers, and wrappers can glue what they wish to glue. 1) Transparent - completely transparent for container-c/r, and largely so for standalone-cr ("largely" - as in except for the glue which is needed due to loss of eco-system, not due to restarting). 2) Reliable - if checkpoint succeeds that it is guaranteed for to succeed too (for container-c/r). 3) Preemtptive - works without requiring that checkpointed processes be scheduled to run (and thus "collaborate") 4) Complete - covers all visible and hidden state in the kernel about processes (even if not directly visible to userspace) 5) Efficient - can be optimized along multiple axes: _zero_ impact on runtime, low downtime during checkpoint, partial and incremental checkpoint, live-migration, etc. 6) Flexible - can integrate nicely with different userspace "glueing" methods. 7) Maintainable - small part of the code is to refactor kernel code so that it can be reused in restart; the rest is new code that in our experience rarely changes. Same hods for the image format. What linux-cr _does not_ do in the kernel, nor plans to support is: 1) Hardware devices: their state is per-device/vendor. Instead one should use virtual devices (VNC for dislpay, pulseaudio for sound, screen for ttys), or have a userspace glue to restore the state of the device. That said, in the future vendors may opt to provide logic for c/r in drivers, e.g. ->checkpoint, ->restart methods. 2) Userspace glue: (as defined for standalone-c/r above) the kernel knows about processes and their state, not about their intentions. We leave that for userspace. 3) External dependencies: (outside of the local host) the kernel does not control what's outside the host. That is the responsibility of userspace. (Even with live-migration, the linux-cr only restores the local state of the TCP connections). Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> 2010-11-19 14:36 ` Kirill Korotaev 2010-11-20 18:08 ` Oren Laadan @ 2010-11-20 18:11 ` Oren Laadan [not found] ` <4CE69B8C.6050606-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2 siblings, 1 reply; 49+ messages in thread From: Oren Laadan @ 2010-11-20 18:11 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA, xemul-3ImXcnM4P+0, Eric W. Biederman, Linux Containers login as: orenl Using keyboard-interactive authentication. Password: Access denied Using keyboard-interactive authentication. Password: Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il 499:takamine[~]$ pine PINE 4.64 COMPOSE MESSAGE Folder: Drafts 8 Messages + To : Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> Cc : Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>, Kapil Arya <kapil-1vnkWVZi4QaVc3sceRu5cw@public.gmane.org>, Gene Cooperman <gene-1vnkWVZi4QaVc3sceRu5cw@public.gmane.org>, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, xemul-3ImXcnM4P+0@public.gmane.org, "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>, Linux Containers <containers-qjLDD68F18O7TbgM5vRIOg@public.gmane.org> Fcc : imap://ol2104-u1PCbA9B4pbMrJhsLK8IO4dd74u8MsAO@public.gmane.org/Sent Attchmnt: Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch ----- Message Text ----- Hi, [continuation of discussion of kernel vs userspace c/r approach] part I: perpsectice about the types of scopes of c/r in discussion part II: linux-cr design adn objectives part III: comparison kernel/userspace approaches PART III: ==SOME TECHNICAL ASPECTS== Important to know about userspace (DMTCP example) before presenting a comparison between kernel and userspace approaches: DMTCP has two components: 1) c/r-engine to save/restore process state, and 2) glue to restart processes out of their original context. They are _orthogonal_: the glue can be used with of other c/r-engines, like linux-cr. This discussion refers to the c/r-engine _only_. Focusing on the c/r-engine of DMTCP - it uses syscall interposition for three reasons: 1) To take control of processes at checkpoint 2) To always track state of resources not visible to userspace 3) To virtualize identifiers after restart #1 is needed because processes saves their own state (and need to run the checkpoint code for that). #2 is needed because the kernel does not expose all state, and #3 is needed because the kernel does not give ways to restore all state. So these two logics are used to mirror in userspace functionality that already exists in the kernel. The main advantages of the approach: (a) portability to other system (like BSD), though with considerable effort (b) it's "good enough" for several use-cases, without kernel changes. Putting the c/r-engine in the kernel provides many advantages, which I summarize in the following table: category linux-cr userspace -------------------------------------------------------------------------------- PERFORMANCE has _zero_ runtime overhead visible overhead due to syscalls interposition and state tracking even w/o checkpoints; OPTIMIZATIONS many optimizations possible limited, less effective only in kernel, for downtime, w/ much larger overhead. image size, live-migration OPERATION applications run unmodified to do c/r, needs 'controller' task (launch and manage _entire_ execution) - point of failure. restricts how a system is used. PREEMPTIVE checkpoint at any time, use processes must be runnable and auxiliary task to save state; "collaborate" for checkpoint; non-intrusive: failure does long task coordination time not impact checkpointees. with many tasks/threads. alters state of checkpointee if fails. e.g. cannot checkpoint when in vfork(), ptrace states, etc. COVERAGE save/restore _all_ task state; needs new ABI for everything: identify shared resources; can expose state, provide means to extend for new kernel features restore state (e.g. TCP protocol easily options negotiated with peers) RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks atomic operation. guaranteed to determine restartability restartability for containers USERSPACE GLUE possible possible SECURITY root and non-root modes root and non-root modes native support for LSM MAINTENANCE changes mainly for features changes mainly for features; create new ABI for features I'm not saying Gene's work isn't good - on the contrary, it's a fine piece of engineering. However, the part of it that does c/r poses many constraints that limits the generality, mode of use, and performance of the whole. That may be enough for Tejun, for your cluster. But not for other users of the technology. And by all means, I intend to cooperate with Gene to see how to make the other part of DMTCP, namely the userspace "glue", work on top of linux-cr to have the benefits of all worlds ! All in all, kernel c/r is far more generic and less restrictive than userspace, can provide nice guarantees, and has superior performance. It can do everything the a userspace c/r can do, and much more - and that "much more" is crucial for important use cases. Last word about maintenance - once the core code is in mainline (which means a code "spike"), experience (both kernel/userspace) shows that both code and image format hardly change. The format is tied to specific set of features supported (i.e. kernel versions) so that the kernel does not need to maintain backward compatibility. Thanks, Oren ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <4CE69B8C.6050606-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <4CE69B8C.6050606-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2010-11-20 18:15 ` Oren Laadan 2010-11-20 19:33 ` Tejun Heo 0 siblings, 1 reply; 49+ messages in thread From: Oren Laadan @ 2010-11-20 18:15 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA, xemul-3ImXcnM4P+0, Eric W. Biederman, Linux Containers [[apologies for the silly prefix on last two posts - a combination of windows, putty, pine andslow connection is not helping me :( ]] ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-20 18:15 ` Oren Laadan @ 2010-11-20 19:33 ` Tejun Heo 2010-11-21 8:18 ` Gene Cooperman 2010-11-22 17:18 ` Oren Laadan 0 siblings, 2 replies; 49+ messages in thread From: Tejun Heo @ 2010-11-20 19:33 UTC (permalink / raw) To: Oren Laadan Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Eric W. Biederman, Linux Containers Hello, On 11/20/2010 07:15 PM, Oren Laadan wrote: > > [[apologies for the silly prefix on last two posts - a combination > of windows, putty, pine andslow connection is not helping me :( ]] Maybe it's a good idea to post a clean concatenated version for later reference? Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-20 19:33 ` Tejun Heo @ 2010-11-21 8:18 ` Gene Cooperman 2010-11-21 8:21 ` Gene Cooperman ` (2 more replies) 2010-11-22 17:18 ` Oren Laadan 1 sibling, 3 replies; 49+ messages in thread From: Gene Cooperman @ 2010-11-21 8:18 UTC (permalink / raw) To: Tejun Heo Cc: Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel, xemul, Eric W. Biederman, Linux Containers In this post, Kapil and I will provide our own summary of how we see the issues for discussion so far. In the next post, we'll reply specifically to comment on Oren's table of comparison between linux-cr and userspace. In general, we'd like to add that the conversation with Oren was very useful for us, and I think Oren will also agree that we were able to converge on the purely technical questions. Concerning opinions, we want to be cautious on opinions, since we're still learning the context of this ongoing discussion on LKML. There is probably still some context that we're missing. Below, we'll summarize the four major questions that we've understood from this discussion so far. But before doing so, I want to point out that a single process or process tree will always have many possible interactions with the rest of the world. Within our own group, we have an internal slogan: "You can't checkpoint the world." A virtual machine can have a relatively closed world, which makes it more robust, but checkpointing will always have some fragile parts. We give four examples below: a. time virtualization b. external database c. NSCD daemon d. screen and other full-screen text programs These are not the only examples of difficult interactions with the rest of the world. Anyway, in my opinion, the conversation with Oren seemed to converge into two larger cases: 1. In a pure userland C/R like DMTCP, how many corner cases are not handled, or could not be handled, in a pure userland approach? Also, how important are those corner cases? Do some have important use cases that rise above just a corner case? [ inotify is one of those examples. For DMTCP to support this, it would have to put wrappers around inotify_add_watch, inotify_rm_watch, read, etc., and maybe even tracking inodes in case the file had been renamed after the inotify_add_watch. Something could be made to work for the common cases, but it would still be a hack --- to be done only if a use case demands it. ] 2. In a Linux C/R approach, it's already recognized that one needs a userland component (for example, for convenience of recreating the process tree on restart). How many other cases are there that require a userland component? [ One example here is the shared memory segment of NSCD, which has to be re-initialized on restart. Another example is a screen process that talks to an ANSI terminal emulator (e.g. gnome-terminal), which talks to an X server or VNC server. Below, we discuss these examples in more detail. ] One can add a third and fourth question here: 3. [Originally posed by Oren] Given Linux C/R, how much work would it be to add the higher layers of DMTCP on top of Linux C/R? [ This is a non-trivial question. As just one example, DMTCP handles sockets uniformly, regardless of whether they are intra-host or inter-host. Linux C/R handles certain types of intra-host sockets. So, merging the two would require some thought. ] 4. [Originally posed by Tejun, e.g. Fri Nov 19 2010 - 09:04:42 EST] Given that DMTCP checkpoints many common applications, how much work would it be to add a small number of restricted kernel interfaces to enable one to remove some of the hacks in DMTCP, and to cover the more important corner cases that DMTCP might be missing? I'd also like to add some points of my own here. First, there are certain cases where I believe that a checkpoint-restart system (in-kernel or userland or hybrid) can never be completely transparent. It's because you can't completely cut the connection with the rest of the world. In these examples, I'm thinking primarily of the Linux C/R mode used to checkpoint a tree of processes. To the extent that Linux C/R is used with containers, it seems to me to be closer to lightweight virtualization. From there, I've seen that the conversation goes to comparing lightweight virtualization versus traditional virtual machines, but that discussion goes beyond my own personal expertise. Here are some examples that I believe that every checkpointing system would suffer from the syndrome of trying to "checkpoint the world". 1. Time virtualization --- Right now, neither system does time virtualization. Both systems could do it. But what is the right policy? For example, one process may set a deadline for a task an hour in the future, and then periodically poll the kernel for the current time to see if one hour has passed. This use case seems to require time virtualization. A second process wants to know the current day and time, because a certain web service updates its information at midnight each day. This use case seems seems to argue that time virtualization is bad. 2. External database file on another host --- It's not possible to checkpoint the remote database file. In our work with the Condor developers, they asked us to add a "Condor mode", which says that if there are any external socket connections, then delay the checkpoint until the external socket connections are closed. In a different joint project with CERN (Geneva), we considered a checkpointing application in which an application saves much of the database, and then on restart, discovers how much of its data is stale, and re-loads only the stale portion. 3. NSCD (Network Services Caching Daemon) --- Glibc arranges for certain information to be cached in the NSCD. The information is in a memory segment shared between the NSCD and the application. Upon restart, the application doesn't know that the memory segment is no longer shared with the NSCD, or that the information is stale. The DMTCP "hack" is to zero out this memory page on restart. Then glibc recognizes that it needs to create a new shared memory segment. 3. screen --- The screen application sets the scrolling region of its ANSI terminal emulator, in order to create a status line at the bottom, while scrolling the remaining lines of the terminal. Upon restart, screen assumes that the scrolling region has already been set up, and doesn't have to be re-initialized. So, on restart, DMTCP uses SIGWINCH to fool screen (or any full-screen text-based application) into believing that its window size has been changed. So, screen (or vim, or emacs) then re-initializes the state of its ANSI terminal, including scrolling regions and so on. So, a userland component is helpful in doing the kind of hacks above. I recognize that the Linux C/R team agrees that some userland component can be useful. I just want to show why some userland hacks will always be needed. Let's consider a pure in-kernel approach to checkpointing 'screen' (or almost any full-screen application that uses a status bar at the bottom). Screen sets the scrolling region of an ANSI terminal emulator, which might be a gnome-terminal. So, a pure in-kernel approach needs to also checkpoint the gnome-terminal. But the gnome-terminal needs to talk to an X server. So, now one also needs to start up inside a VNC server to emulate the X server. So, either one adds a "hack" in userland to force screen to re-initialize its ANSI terminal emulator, or else one is forced to include an entire VNC server just to checkpoint a screen process. ] Finally, this excerpt below from Tejun's post sums up our views too. We don't have the kernel expertise of the people on this list, but we've had to do a little bit of reading the kernel code where the documentation was sparse and in teaching O/S. We would certainly be very happy to work closely with the kernel developers, if there was interest in extending DMTCP to directly use more kernel support. - Gene and Kapil Tejun Heo wrote Fri Nov 19 2010 - 09:04:42 EST > What's so wrong with Gene's work? Sure, it has some hacky aspects but > let's fix those up. To me, it sure looks like much saner and > manageable approach than in-kernel CR. We can add nested ptrace, > CLONE_SET_PID (or whatever) in pidns, integrate it with various ns > supports, add an ability to adjust brk, export inotify state via > fdinfo and so on. > > The thing is already working, the codebase of core part is fairly > small and condor is contemplating integrating it, so at least some > people in HPC segment think it's already viable. Maybe the HPC > cluster I'm currently sitting near is special case but people here > really don't run very fancy stuff. In most cases, they're fairly > simple (from system POV) C programs reading/writing data and burning a > _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp > integrated with condor would work well enough for them. > > Sure, in-kernel CR has better or more reliable coverage now but by how > much? The basic things are already there in userland. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-21 8:18 ` Gene Cooperman @ 2010-11-21 8:21 ` Gene Cooperman 2010-11-22 18:02 ` Sukadev Bhattiprolu 2010-11-23 17:53 ` Oren Laadan 2010-11-21 22:41 ` Grant Likely 2010-11-22 17:34 ` Oren Laadan 2 siblings, 2 replies; 49+ messages in thread From: Gene Cooperman @ 2010-11-21 8:21 UTC (permalink / raw) To: Gene Cooperman Cc: Tejun Heo, Oren Laadan, Kapil Arya, linux-kernel, xemul, Eric W. Biederman, Linux Containers As Kapil and I wrote before, we benefited greatly from having talked with Oren, and learning some more about the context of the discussion. We were able to understand better the good technical points that Oren was making. Since the comparison table below concerns DMTCP, we'd like to state some additional technical points that could affect the conlusions. > category linux-cr userspace > -------------------------------------------------------------------------------- > PERFORMANCE has _zero_ runtime overhead visible overhead due to syscalls > interposition and state tracking > even w/o checkpoints; In our experiments so far, the overhead of system calls has been unmeasurable. We never wrap read() or write(), in order to keep overhead low. We also never wrap pthread synchronization primitives such as locks, for the same reason. The other system calls are used much less often, and so the overhead has been too small to measure in our experiments. > OPTIMIZATIONS many optimizations possible limited, less effective > only in kernel, for downtime, w/ much larger overhead. > image size, live-migration As above, we believe that the overhead while running is negligible. I'm assuming that image size refers to in-kernel advantages for incremental checkpointing. This is useful for apps where the modified pages tend not to dominate. We agree with this point. As an orthogonal point, by default DMTCP compresses all checkpoint images using gzip on the fly. This is useful even when most pages are modified between checkpoints. Still, as Oren writes, Linux C/R could also add a userland component to compress checkpoint images on the fly. Next, live migration is a question that we simply haven't thought much about. If it's important, we could think about what userland approaches might exist, but we have no near-term plans to tackle live migration. > OPERATION applications run unmodified to do c/r, needs 'controller' > task (launch and manage _entire_ > execution) - point of failure. > restricts how a system is used. We'd like to clarify what may be some misconceptions. The DMTCP controller does not launch or manage any tasks. The DMTCP controller is stateless, and is only there to provide a barrier, namespace server, and single point of contact to relay ckpt/restart commands. Recall that the DMTCP controller handls processes across hosts --- not just on a single host. Also, in any computation involving multiple processes, _every_ process of the computation is a point of failure. If any process of the computation dies, then the simple application strategy is to give up and revert to an earlier checkpoint. There are techniques by which an app or DMTCP can recreate certain failed processes. DMTCP doesn't currently recreate a dead controller (no demand for it), but it's not hard to do technically. > PREEMPTIVE checkpoint at any time, use processes must be runnable and > auxiliary task to save state; "collaborate" for checkpoint; > non-intrusive: failure does long task coordination time > not impact checkpointees. with many tasks/threads. alters > state of checkpointee if fails. > e.g. cannot checkpoint when in > vfork(), ptrace states, etc. Our current support of vfork and ptrace has some of the issues that Oren points out. One example occurs if a process is in the kernel, and a ptrace state has changed. If it was important for some application, we would either have to think of some "hack", or follow Tejun's alternative suggestion to work with the developers to add further kernel support. The kernel developers on this list can estimate the difficulties of kernel support better than I can. > COVERAGE save/restore _all_ task state; needs new ABI for everything: > identify shared resources; can expose state, provide means to > extend for new kernel features restore state (e.g. TCP protocol > easily options negotiated with peers) Currently, the only kernel support used by DMTCP is system calls (wrappers), /proc/*/fd, /proc/*/maps, /proc/*/cmdline, /proc/*/exe, /proc/*/stat. (I think I've named them all now.) The kernel developers will know better than us what other kernel state one might want to support for C/R, and what types of applications would need that. > RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks > atomic operation. guaranteed to determine restartability > restartability for containers My understanding is that the guarantees apply for Linux containers, but not for a tree of processes. Does this imply that linux-cr would have some of the same reliability issues as DMTCP for a tree of processes? (I mean the question sincerely, and am not intending to be rude.) In any case, won't DMTCP and Linux C/R have to handle orthogonal reliability issues such as external database, time virtualization, and other examples from our previous post? > USERSPACE GLUE possible possible > > SECURITY root and non-root modes root and non-root modes > native support for LSM > > MAINTENANCE changes mainly for features changes mainly for features; > create new ABI for features > iAnd by all means, I intend to cooperate with Gene to see how to > make the other part of DMTCP, namely the userspace "glue", work on > top of linux-cr to have the benefits of all worlds ! This is true, and we strongly welcome the cooperation. We don't know how this experiment will turn out, but the only way to find out is to sincerely try it. Whether we succeed or fail, we will learn something either way! - Gene and Kapil ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-21 8:21 ` Gene Cooperman @ 2010-11-22 18:02 ` Sukadev Bhattiprolu 2010-11-23 17:53 ` Oren Laadan 1 sibling, 0 replies; 49+ messages in thread From: Sukadev Bhattiprolu @ 2010-11-22 18:02 UTC (permalink / raw) To: Gene Cooperman Cc: Kapil Arya, linux-kernel, xemul, Linux Containers, Eric W. Biederman, Tejun Heo Gene Cooperman [gene@ccs.neu.edu] wrote: | > RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks | > atomic operation. guaranteed to determine restartability | > restartability for containers | | My understanding is that the guarantees apply for Linux containers, but not | for a tree of processes. Does this imply that linux-cr would have some | of the same reliability issues as DMTCP for a tree of processes? (I mean | the question sincerely, and am not intending to be rude.) In any case, | won't DMTCP and Linux C/R have to handle orthogonal reliability issues | such as external database, time virtualization, and other examples | from our previous post? Yes if the user attempts to checkpoint a partial container (what we refer to process subtree) or fails to snapshot/restore filesystem there could be leaks that we cannot detect. But one guarantee we are trying to provide is that if the user checkpoints a _complete_ container, then we will detect a leak if one exists. Is there a way to establish a set of constraints (eg: run application in a container, snapshot/restore filesystem) and then provide leak detection with a pure userpsace implementation ? Sukadev ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-21 8:21 ` Gene Cooperman 2010-11-22 18:02 ` Sukadev Bhattiprolu @ 2010-11-23 17:53 ` Oren Laadan 2010-11-24 3:50 ` Kapil Arya 1 sibling, 1 reply; 49+ messages in thread From: Oren Laadan @ 2010-11-23 17:53 UTC (permalink / raw) To: Gene Cooperman Cc: Tejun Heo, Kapil Arya, linux-kernel, xemul, Eric W. Biederman, Linux Containers On Sun, 21 Nov 2010, Gene Cooperman wrote: > As Kapil and I wrote before, we benefited greatly from having talked with Oren, > and learning some more about the context of the discussion. We were able > to understand better the good technical points that Oren was making. > Since the comparison table below concerns DMTCP, we'd like to > state some additional technical points that could affect the conlusions. > > > category linux-cr userspace > > -------------------------------------------------------------------------------- > > PERFORMANCE has _zero_ runtime overhead visible overhead due to syscalls > > interposition and state tracking > > even w/o checkpoints; > > In our experiments so far, the overhead of system calls has been > unmeasurable. We never wrap read() or write(), in order to keep overhead low. > We also never wrap pthread synchronization primitives such as locks, > for the same reason. The other system calls are used much less often, and so > the overhead has been too small to measure in our experiments. Syscall interception will have visible effect on applications that use those syscalls. You may not observe overheasd with HPC ones, but do you have numbers on server apps ? apps that use fork/clone and pipes extensively ? threads benchmarks et ? compare that to aboslute zero overhead of linux-cr. > > > OPTIMIZATIONS many optimizations possible limited, less effective > > only in kernel, for downtime, w/ much larger overhead. > > image size, live-migration > > As above, we believe that the overhead while running is negligible. I'm For the HPC apps that you use. > assuming that image size refers to in-kernel advantages for incremental > checkpointing. This is useful for apps where the modified pages tend > not to dominate. We agree with this point. As an orthogonal point, > by default DMTCP compresses all checkpoint images using gzip on the fly. > This is useful even when most pages are modified between checkpoints. > Still, as Oren writes, Linux C/R could also add a userland component > to compress checkpoint images on the fly. This is not "userland component", it's "checkpoint | gzip > image.out"... > Next, live migration is a question that we simply haven't thought much > about. If it's important, we could think about what userland approaches might > exist, but we have no near-term plans to tackle live migration. As it is, live-migration _is_ a very important use case. > > > OPERATION applications run unmodified to do c/r, needs 'controller' > > task (launch and manage _entire_ > > execution) - point of failure. > > restricts how a system is used. > > We'd like to clarify what may be some misconceptions. The DMTCP > controller does not launch or manage any tasks. The DMTCP controller > is stateless, and is only there to provide a barrier, namespace server, > and single point of contact to relay ckpt/restart commands. Recall that > the DMTCP controller handls processes across hosts --- not just on a > single host. The controller is another point of failure. I already pointed that the (controlled) application crashes when your controller dies, and you mentioned it's a bug that should be fixed. But then there will always be a risk for another, and another ... You also mentioned that if the controller dies, then the app should contionue to run, but will not be checkpointable anymore (IIUC). The point is, that the controller is another point of failure, and makes the execution/checkpoint intrusive. It also adds security and user-management issues as you'll need one (or more ?) controller per user (right now, it's one for all, no ?). and so on. Plus, because the restarted apps get their virtualized IDs from the controller, then they can't now "see" existing/new processes that may get the "same" pids (virtualization is not in the kernel). > Also, in any computation involving multiple processes, _every_ process > of the computation is a point of failure. If any process of the computation > dies, then the simple application strategy is to give up and revert to an > earlier checkpoint. There are techniques by which an app or DMTCP can > recreate certain failed processes. DMTCP doesn't currently recreate > a dead controller (no demand for it), but it's not hard to do technically. The point is that you _add_ a point of failure: you make the "checkpoint" operation a possible reason for the application to crash. In contrast, in linux-cr the checkpoiint is idempotent - nunharmful because it does not make the applications execute. Instead, it merely observes their state. > > PREEMPTIVE checkpoint at any time, use processes must be runnable and > > auxiliary task to save state; "collaborate" for checkpoint; > > non-intrusive: failure does long task coordination time > > not impact checkpointees. with many tasks/threads. alters > > state of checkpointee if fails. > > e.g. cannot checkpoint when in > > vfork(), ptrace states, etc. > > Our current support of vfork and ptrace has some of the issues that Oren points > out. One example occurs if a process is in the kernel, and a ptrace state has > changed. If it was important for some application, we would either have > to think of some "hack", or follow Tejun's alternative suggestion to work > with the developers to add further kernel support. The kernel developers > on this list can estimate the difficulties of kernel support better than I can. > > > COVERAGE save/restore _all_ task state; needs new ABI for everything: > > identify shared resources; can expose state, provide means to > > extend for new kernel features restore state (e.g. TCP protocol > > easily options negotiated with peers) > > Currently, the only kernel support used by DMTCP is system calls (wrappers), > /proc/*/fd, /proc/*/maps, /proc/*/cmdline, /proc/*/exe, /proc/*/stat. (I think > I've named them all now.) The kernel developers will know better > than us what other kernel state one might want to support for C/R, and what > types of applications would need that. > > > RELIABILITY checkpoint w/ single syscall; non-atomic, cannot find leaks > > atomic operation. guaranteed to determine restartability > > restartability for containers > > My understanding is that the guarantees apply for Linux containers, but not > for a tree of processes. Does this imply that linux-cr would have some > of the same reliability issues as DMTCP for a tree of processes? (I mean > the question sincerely, and am not intending to be rude.) In any case, > won't DMTCP and Linux C/R have to handle orthogonal reliability issues > such as external database, time virtualization, and other examples > from our previous post? There are two points in the claim above: 1) linux-cr can checkpoint with a single syscall - it's atomic. This gives you more guarantees about the consistency of the checkpointed application(s), and less "opportunitites" for the operation as a whole to fail. 2) restartability - for full-container checkpoint only. There is no "reliability" issue with c/r of non-containers - it's a matter of definition: it depends on what your requirements from the userspace application and what sort of "glue" you have for it. And I request again - let's leave out the questions of "time virtualization" and "external databases" - how are they different for the VM virtalization solution ? they are conpletely orthogonal to the question we are debating. Thanks, Oren. > > > USERSPACE GLUE possible possible > > > > SECURITY root and non-root modes root and non-root modes > > native support for LSM > > > > MAINTENANCE changes mainly for features changes mainly for features; > > create new ABI for features > > > iAnd by all means, I intend to cooperate with Gene to see how to > > make the other part of DMTCP, namely the userspace "glue", work on > > top of linux-cr to have the benefits of all worlds ! > > This is true, and we strongly welcome the cooperation. We don't know how > this experiment will turn out, but the only way to find out is to sincerely > try it. Whether we succeed or fail, we will learn something either way! > > - Gene and Kapil > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-23 17:53 ` Oren Laadan @ 2010-11-24 3:50 ` Kapil Arya 2010-11-25 16:04 ` Oren Laadan 0 siblings, 1 reply; 49+ messages in thread From: Kapil Arya @ 2010-11-24 3:50 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Tejun Heo, linux-kernel, xemul, Eric W. Biederman, Linux Containers (Our first comment below actually replies to an earlier post by Oren. It seemed simpler to combine our comments.) > > d. screen and other full-screen text programs These are not the only > > examples of difficult interactions with the rest of the world. > > This actually never required a userspace "component" with Zap or linux-cr (to > the best of my knowledge). We would guess that Zap would not be able to support screen without a user space component. The bug occurs when screen is configured to have a status line at the bottom. We would be interested if you want to try it and let us know the results. ============================================= > > > category linux-cr > > > userspace > > > -------------------------------------------------------------------------------- > > > PERFORMANCE has _zero_ runtime overhead visible overhead due to > > > syscalls interposition and state tracking even w/o checkpoints; > > > > In our experiments so far, the overhead of system calls has been > > unmeasurable. We never wrap read() or write(), in order to keep overhead > > low. We also never wrap pthread synchronization primitives such as locks, > > for the same reason. The other system calls are used much less often, and > > so the overhead has been too small to measure in our experiments. > > Syscall interception will have visible effect on applications that use those > syscalls. You may not observe overheasd with HPC ones, but do you have > numbers on server apps ? apps that use fork/clone and pipes extensively ? > threads benchmarks et ? compare that to aboslute zero overhead of linux-cr. Its true that we haven't taken serious data on overhead with server apps. Is there a particular server app that you are thinking of as an example? I would expect fork/clone and pipes to be invoked infrequently in the server apps and do not add measurably to CPU time. In most server apps such as MySQL, it is common to maintain a pool of threads for reuse rather than to repeatedly call clone for a new thread. This is done to ensure that the overhead of the clone calls is not significant. I would expect a similar policy for fork and pipes. <snip> > > > OPERATION applications run unmodified to do c/r, needs > > > 'controller' task (launch and manage _entire_ execution) - point of > > > failure. restricts how a system is used. > > > > We'd like to clarify what may be some misconceptions. The DMTCP controller > > does not launch or manage any tasks. The DMTCP controller is stateless, > > and is only there to provide a barrier, namespace server, and single point > > of contact to relay ckpt/restart commands. Recall that the DMTCP > > controller handls processes across hosts --- not just on a single host. > > The controller is another point of failure. I already pointed that the > (controlled) application crashes when your controller dies, and you mentioned > it's a bug that should be fixed. But then there will always be a risk for > another, and another ... You also mentioned that if the controller dies, > then the app should contionue to run, but will not be checkpointable anymore > (IIUC). > > The point is, that the controller is another point of failure, and makes the > execution/checkpoint intrusive. It also adds security and user-management > issues as you'll need one (or more ?) controller per user (right now, it's > one for all, no ?). and so on. Just to clarify, DMTCP uses one coordinator for each checkpointable computation. A single user may be running multiple computations with one coordinator for each computation. We don't actually use the word controller in DMTCP terminology because the coordinator is stateless and so in coordinating but not controlling other processes. > Plus, because the restarted apps get their virtualized IDs from the > controller, then they can't now "see" existing/new processes that may get the > "same" pids (virtualization is not in the kernel). This appears to be a misconception. The wrappers within the user process maintain the pid-translation table for that process. The translation table is the translation between the original pid given by the kernel and the current pid set by the kernel on restart. This is handled locally and does not involve the coordinator. In the case of a fork there could be a pid-clash (the original pid generated for a new process that conflicts with someone else's original pid). However, DMTCP handles this by checking within the fork wrapper for a pid-clash. In the rare case of a pid-clash, the child process exits and the parent forks again. Same applies for clone and any pid clash at restart time. > > Also, in any computation involving multiple processes, _every_ process > > of the computation is a point of failure. If any process of the > > computation dies, then the simple application strategy is to give up > > and revert to an earlier checkpoint. There are techniques by which an > > app or DMTCP can recreate certain failed processes. DMTCP doesn't > > currently recreate a dead controller (no demand for it), but it's not > > hard to do technically. > > The point is that you _add_ a point of failure: you make the "checkpoint" > operation a possible reason for the application to crash. In contrast, in > linux-cr the checkpoiint is idempotent - nunharmful because it does not make > the applications execute. Instead, it merely observes their state. We were speaking above of the case when the process dies during a computation. We were not referring to checkpoint time. <snip> We would like to add our own comment/question. To set the context we quote an earlier post: OL> Even if it did - the question is not how to deal with "glue" OL> (you demonstrated quite well how to do that with DMTCP), but OL> how should teh basic, core c/r functionality work - which is OL> below, and orthogonal to the "glue". There seems to be an implicit assumption that it is easy to separate the DMTCP "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but it splits the problems into modules along a different line than Linux C/R. We look forward to the joint experiment in which we would try to combine DMTCP with Linux C/R. This will help answer the question in our mind. In order to explore the issue, let's imagine that we have a successful merge of DMTCP and Linux C/R. The following are some user-space glue issues. It's not obvious to us how the merged software will handle these issues. 1. Sockets -- DMTCP handles all sockets in a common manner through a single module. Sockets are checkpointed independently of whether they are local or remote. In a merger of DMTCP and Linux C/R, what does Linux C/R do when it sees remote sockets? Or should DMTCP take down all remote sockets before checkpointing? If DMTCP has to do this, it would be less efficient than the current design which keeps the remote sockets connections alive during checkpoint. 2. XLib and X11-server -- Consider checkpointing a single X11 app without the X11-server and without VNC. This is something we intend to add to DMTCP in the next few months. We have already mapped out the design in our minds. An X11 application includes the Xlib library. The data of an X11 window is, by default, contained in the X11 library -- not in the X11-server. The application communicates with the X11-server using socket connections, which would be considered a leak by Linux C/R. At restart time, DMTCP will ask the X11-server to create a bare window and then make the appropriate Xlib call to repaint the window based on the data stored in the Xlib library. For checkpoint/resume, the window stays up and does not has to be repainted. How will the combined DMTCP/Linux C/R work? Will DMTCP have to take down the window prior to Linux C/R and paint a new window at resume time? Doesn't this add inefficiency? 3. Checkpointing a single process (e.g. a bash shell) talking to an xterm via a pty -- We assume that from the viewpoint of Linux C/R a pty is a leak since there is a second process operating the master end of the pty. In this case we are guessing that Linux C/R would checkpoint and restart without the gurantees of reliability. We are guessing that Linux C/R would not save and restore the pty, instead it would be the responsibility of DMTCP to restore the current settings of the pty (e.g. packet mode vs. regular mode). Is our understanding correct? Would this work? Thanks, Gene and Kapil ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-24 3:50 ` Kapil Arya @ 2010-11-25 16:04 ` Oren Laadan 2010-11-29 4:09 ` Gene Cooperman 0 siblings, 1 reply; 49+ messages in thread From: Oren Laadan @ 2010-11-25 16:04 UTC (permalink / raw) To: Kapil Arya Cc: Gene Cooperman, Tejun Heo, linux-kernel, xemul, Eric W. Biederman, Linux Containers [-- Attachment #1: Type: TEXT/PLAIN, Size: 4441 bytes --] On Tue, 23 Nov 2010, Kapil Arya wrote: > OL> Even if it did - the question is not how to deal with "glue" > OL> (you demonstrated quite well how to do that with DMTCP), but > OL> how should teh basic, core c/r functionality work - which is > OL> below, and orthogonal to the "glue". > > There seems to be an implicit assumption that it is easy to separate the DMTCP > "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but > it splits the problems into modules along a different line than Linux C/R. We > look forward to the joint experiment in which we would try to combine DMTCP > with Linux C/R. This will help answer the question in our mind. I apologize for being blunt - but this is probably an issue specific to DMTCP's engineering... > In order to explore the issue, let's imagine that we have a successful merge of > DMTCP and Linux C/R. The following are some user-space glue issues. It's not > obvious to us how the merged software will handle these issues. > > 1. Sockets -- DMTCP handles all sockets in a common manner through a single > module. Sockets are checkpointed independently of whether they are local or > remote. In a merger of DMTCP and Linux C/R, what does Linux C/R do when it sees > remote sockets? Or should DMTCP take down all remote sockets before > checkpointing? If DMTCP has to do this, it would be less efficient than the > current design which keeps the remote sockets connections alive during > checkpoint. What is a "local" socket ? af_unix, or locally connected af_inet ? Anyway, with linux-cr you'd do what's needed after the restarted tasks are created, but before their state is restored. For each such "old" socket that you want to replace, you'd create (in userspace with arbitrary glue" code!) a new socket, and use this socket when restoring the state of the task. Similarly, you could replace any other resource, not only sockets. > > 2. XLib and X11-server -- Consider checkpointing a single X11 app without the > X11-server and without VNC. This is something we intend to add to DMTCP in the > next few months. We have already mapped out the design in our minds. An X11 > application includes the Xlib library. The data of an X11 window is, by > default, contained in the X11 library -- not in the X11-server. The application > communicates with the X11-server using socket connections, which would be > considered a leak by Linux C/R. At restart time, DMTCP will ask the > X11-server to create a bare window and then make the appropriate Xlib call to > repaint the window based on the data stored in the Xlib library. > For checkpoint/resume, the window stays up and does not has to be repainted. > How will the combined DMTCP/Linux C/R work? Will DMTCP have to take > down the window prior to Linux C/R and paint a new window at resume time? > Doesn't this add inefficiency? Repainting during restart is the least of your problems. Leak detection is not a problem: If the socket connects out of the containers (like af_inet) - then it is not a leak, andyou treat it as described above. If the sockets connects within the container but you don't checkpoint the "peer" process - then it is not a container-c/r (in which case you don't look for leaks). Also, the application could mark resources to not be checkpointed (e.g. scratch memory to save storage, or sockets to not count as leaks). I don't see any problem with X11 or any other library and "glue". > > 3. Checkpointing a single process (e.g. a bash shell) talking to an xterm via > a pty -- We assume that from the viewpoint of Linux C/R a pty is a leak since > there is a second process operating the master end of the pty. In this > case we are > guessing that Linux C/R would checkpoint and restart without the gurantees of > reliability. We are guessing that Linux C/R would not save and restore the pty, > instead it would be the responsibility of DMTCP to restore the current settings > of the pty (e.g. packet mode vs. regular mode). Is our understanding correct? > Would this work? I explain again - in case it wasn't clear from my 3-part post: leak detection is relevant _only_ for full container-c/r. It doesn't make sense otherwise. If you want to checkpoint individual components of an application, then it's up to userspace to produce/provide the relevant "glue" to make it "make sense" when those components restart without their original eco-system. Thanks, Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-25 16:04 ` Oren Laadan @ 2010-11-29 4:09 ` Gene Cooperman 0 siblings, 0 replies; 49+ messages in thread From: Gene Cooperman @ 2010-11-29 4:09 UTC (permalink / raw) To: Oren Laadan Cc: Kapil Arya, Gene Cooperman, Tejun Heo, linux-kernel, xemul, Eric W. Biederman, Linux Containers Hi Oren, On Thu, Nov 25, 2010 at 11:04:16AM -0500, Oren Laadan wrote: > On Tue, 23 Nov 2010, Kapil Arya wrote: > > > OL> Even if it did - the question is not how to deal with "glue" > > OL> (you demonstrated quite well how to do that with DMTCP), but > > OL> how should teh basic, core c/r functionality work - which is > > OL> below, and orthogonal to the "glue". > > > > There seems to be an implicit assumption that it is easy to separate the DMTCP > > "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but > > it splits the problems into modules along a different line than Linux C/R. We > > look forward to the joint experiment in which we would try to combine DMTCP > > with Linux C/R. This will help answer the question in our mind. > > I apologize for being blunt - but this is probably an issue specific to > DMTCP's engineering... > I completely agree with you, Oren. DMTCP was never designed to be split into a userland and in-kernel replacement. We will want to re-factor DMTCP to make this happen. I'm sorry if my e-mail came off as confrontational. That was not my intention. I was just looking forward to an interesting intellectual experiment --- how to go about combining DMTCP and Linux C/R. I was trying to guess ahead of time where there are interesting challenges, and my hope is that we will find a way to solve them together. Best wishes, - Gene ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-21 8:18 ` Gene Cooperman 2010-11-21 8:21 ` Gene Cooperman @ 2010-11-21 22:41 ` Grant Likely 2010-11-22 17:34 ` Oren Laadan 2 siblings, 0 replies; 49+ messages in thread From: Grant Likely @ 2010-11-21 22:41 UTC (permalink / raw) To: Gene Cooperman Cc: Tejun Heo, Oren Laadan, Kapil Arya, linux-kernel, xemul, Eric W. Biederman, Linux Containers On Sun, Nov 21, 2010 at 03:18:53AM -0500, Gene Cooperman wrote: > In this post, Kapil and I will provide our own summary of how we > see the issues for discussion so far. In the next post, we'll reply > specifically to comment on Oren's table of comparison between > linux-cr and userspace. > > In general, we'd like to add that the conversation with Oren was very > useful for us, and I think Oren will also agree that we were able to > converge on the purely technical questions. Hi Gene, Thanks for the good summary, it helps. Some random comments below... > > Concerning opinions, we want to be cautious on opinions, since we're > still learning the context of this ongoing discussion on LKML. There is > probably still some context that we're missing. > > Below, we'll summarize the four major questions that we've understood from > this discussion so far. But before doing so, I want to point out that a single > process or process tree will always have many possible interactions with > the rest of the world. Within our own group, we have an internal slogan: > "You can't checkpoint the world." > A virtual machine can have a relatively closed world, which makes it more > robust, but checkpointing will always have some fragile parts. > We give four examples below: > a. time virtualization > b. external database > c. NSCD daemon > d. screen and other full-screen text programs > These are not the only examples of difficult interactions with the > rest of the world. > > Anyway, in my opinion, the conversation with Oren seemed to converge > into two larger cases: > 1. In a pure userland C/R like DMTCP, how many corner cases are not handled, > or could not be handled, in a pure userland approach? > Also, how important are those corner cases? Do some > have important use cases that rise above just a corner case? > [ inotify is one of those examples. For DMTCP to support this, > it would have to put wrappers around inotify_add_watch, > inotify_rm_watch, read, etc., and maybe even tracking inodes in case > the file had been renamed after the inotify_add_watch. Something > could be made to work for the common cases, but it would > still be a hack --- to be done only if a use case demands it. ] > 2. In a Linux C/R approach, it's already recognized that one needs > a userland component (for example, for convenience of recreating > the process tree on restart). How many other cases are there > that require a userland component? > [ One example here is the shared memory segment of NSCD, which > has to be re-initialized on restart. Another example is > a screen process that talks to an ANSI terminal emulator > (e.g. gnome-terminal), which talks to an X server or VNC server. > Below, we discuss these examples in more detail. ] > > One can add a third and fourth question here: > > 3. [Originally posed by Oren] Given Linux C/R, how much work would > it be to add the higher layers of DMTCP on top of Linux C/R? > [ This is a non-trivial question. As just one example, DMTCP > handles sockets uniformly, regardless of whether they > are intra-host or inter-host. Linux C/R handles certain > types of intra-host sockets. So, merging the two would > require some thought. ] > 4. [Originally posed by Tejun, e.g. Fri Nov 19 2010 - 09:04:42 EST] > Given that DMTCP checkpoints many common applications, how much work > would it be to add a small number of restricted kernel interfaces > to enable one to remove some of the hacks in DMTCP, and to cover > the more important corner cases that DMTCP might be missing? > > > I'd also like to add some points of my own here. First, there are certain > cases where I believe that a checkpoint-restart system (in-kernel > or userland or hybrid) can never be completely transparent. It's because you > can't completely cut the connection with the rest of the world. In these > examples, I'm thinking primarily of the Linux C/R mode used to checkpoint > a tree of processes. > To the extent that Linux C/R is used with containers, it seems > to me to be closer to lightweight virtualization. From there, I've > seen that the conversation goes to comparing lightweight virtualization > versus traditional virtual machines, but that discussion goes beyond my > own personal expertise. At the risk of restating already applied arguments, and as a c/r outsider, this touches on the real crux of the issue for me. What is the complete set of boundaries between a c/r group of processes and the outside world? Is it bounded and is it understandable by mere kernel engineers? Does it change the assumptions about what a Linux process /is/, and how to handle it? How much? The broad strokes seem to be straight forward, but as already pointed out, the devil is in the details. > Here are some examples that I believe that every checkpointing system > would suffer from the syndrome of trying to "checkpoint the world". > > 1. Time virtualization --- Right now, neither system does time virtualization. > Both systems could do it. But what is the right policy? > For example, one process may set a deadline for a task an hour > in the future, and then periodically poll the kernel for the current time > to see if one hour has passed. This use case seems to require time > virtualization. > A second process wants to know the current day and time, because a certain > web service updates its information at midnight each day. This use case seems > seems to argue that time virtualization is bad. Temporal issues need to be (are being?) addressed regardless. In certain respects, I'm sure c/r can be seen as a *really long* scheduler latency, and would have the same effect as a system going into suspend, or a vm-level checkpoint. I would think the same behaviour would be desirable in all cases, include c/r. > 2. External database file on another host --- It's not possible to > checkpoint the remote database file. In our work with the Condor developers, > they asked us to add a "Condor mode", which says that if there are any > external socket connections, then delay the checkpoint until the external > socket connections are closed. In a different joint project with CERN (Geneva), > we considered a checkpointing application in which an application > saves much of the database, and then on restart, discovers how much > of its data is stale, and re-loads only the stale portion. > > 3. NSCD (Network Services Caching Daemon) --- Glibc arranges for > certain information to be cached in the NSCD. The information is > in a memory segment shared between the NSCD and the application. > Upon restart, the application doesn't know that the memory segment > is no longer shared with the NSCD, or that the information is stale. > The DMTCP "hack" is to zero out this memory page on restart. Then glibc > recognizes that it needs to create a new shared memory segment. Right here is exactly the example of a boundary that needs explicit rules. When a pair of processes have a shared region, and only one of them is checkpointed, then what is the behaviour on restore? In this specific example, a context-specific hack is used to achieve the desired result, but that doesn't work (as I believe you agree) in the general case. What behaviour will in-kernel support need to enforce? > 3. screen --- The screen application sets the scrolling region of > its ANSI terminal emulator, in order to create a status line > at the bottom, while scrolling the remaining lines of the terminal. > Upon restart, screen assumes that the scrolling region > has already been set up, and doesn't have to be re-initialized. > So, on restart, DMTCP uses SIGWINCH to fool screen (or any > full-screen text-based application) into believing that its > window size has been changed. So, screen (or vim, or emacs) > then re-initializes the state of its ANSI terminal, including > scrolling regions and so on. > So, a userland component is helpful in doing the kind of hacks above. > I recognize that the Linux C/R team agrees that some userland component > can be useful. I just want to show why some userland hacks will always be > needed. Let's consider a pure in-kernel approach to checkpointing 'screen' > (or almost any full-screen application that uses a status bar at the bottom). > Screen sets the scrolling region of an ANSI terminal emulator, > which might be a gnome-terminal. So, a pure in-kernel approach > needs to also checkpoint the gnome-terminal. But the gnome-terminal > needs to talk to an X server. So, now one also needs to start > up inside a VNC server to emulate the X server. So, either > one adds a "hack" in userland to force screen to re-initialize > its ANSI terminal emulator, or else one is forced to include > an entire VNC server just to checkpoint a screen process. ] > > Finally, this excerpt below from Tejun's post sums up our views too. We don't > have the kernel expertise of the people on this list, but we've had > to do a little bit of reading the kernel code where the documentation > was sparse and in teaching O/S. We would certainly be very happy to work > closely with the kernel developers, if there was interest in extending > DMTCP to directly use more kernel support. > > - Gene and Kapil > > Tejun Heo wrote Fri Nov 19 2010 - 09:04:42 EST > > What's so wrong with Gene's work? Sure, it has some hacky aspects but > > let's fix those up. To me, it sure looks like much saner and > > manageable approach than in-kernel CR. We can add nested ptrace, > > CLONE_SET_PID (or whatever) in pidns, integrate it with various ns > > supports, add an ability to adjust brk, export inotify state via > > fdinfo and so on. > > > > The thing is already working, the codebase of core part is fairly > > small and condor is contemplating integrating it, so at least some > > people in HPC segment think it's already viable. Maybe the HPC > > cluster I'm currently sitting near is special case but people here > > really don't run very fancy stuff. In most cases, they're fairly > > simple (from system POV) C programs reading/writing data and burning a > > _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp > > integrated with condor would work well enough for them. > > > > Sure, in-kernel CR has better or more reliable coverage now but by how > > much? The basic things are already there in userland. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-21 8:18 ` Gene Cooperman 2010-11-21 8:21 ` Gene Cooperman 2010-11-21 22:41 ` Grant Likely @ 2010-11-22 17:34 ` Oren Laadan 2 siblings, 0 replies; 49+ messages in thread From: Oren Laadan @ 2010-11-22 17:34 UTC (permalink / raw) To: Gene Cooperman Cc: Tejun Heo, Kapil Arya, linux-kernel, xemul, Eric W. Biederman, Linux Containers On Sun, 21 Nov 2010, Gene Cooperman wrote: > Below, we'll summarize the four major questions that we've understood from > this discussion so far. But before doing so, I want to point out that a single > process or process tree will always have many possible interactions with > the rest of the world. Within our own group, we have an internal slogan: > "You can't checkpoint the world." > A virtual machine can have a relatively closed world, which makes it more > robust, but checkpointing will always have some fragile parts. That depends of what your definition of "world". One definition is "world := VM", as you state above. Another is "world := container" which I stated in my post(s). You can checkpoint both. For those cases where the "world" cannot be fully checkpointed, I explicitly pointed that we should focus on the core c/r functionality, because the "glue" can be done either way. > We give four examples below: > a. time virtualization IMHO, irrelevant to current discussion. And btw, this is done in linux-cr for live migration of tcp connections. > b. external database > c. NSCD daemon This falls within the category of "glue", and is - as I try once again to remind - tentirely oorthogonal to the topic of where to do c/r. > d. screen and other full-screen text programs > These are not the only examples of difficult interactions with the > rest of the world. This actually never required a userspace "component" with Zap or linux-cr (to the best of my knowledge).. Even if it did - the question is not how to deal with "glue" (you demonstrated quite well how to do that with DMTCP), but how should teh basic, core c/r functionality work - which is below, and orthogonal to the "glue". Let us please focus on the base c/r engine functionality... (gotta disconnect now .. more later) Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-20 19:33 ` Tejun Heo 2010-11-21 8:18 ` Gene Cooperman @ 2010-11-22 17:18 ` Oren Laadan 1 sibling, 0 replies; 49+ messages in thread From: Oren Laadan @ 2010-11-22 17:18 UTC (permalink / raw) To: Tejun Heo Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Eric W. Biederman, Linux Containers On Sat, 20 Nov 2010, Tejun Heo wrote: > Hello, > > On 11/20/2010 07:15 PM, Oren Laadan wrote: > > > > [[apologies for the silly prefix on last two posts - a combination > > of windows, putty, pine andslow connection is not helping me :( ]] > > Maybe it's a good idea to post a clean concatenated version for later > reference? > Sure, as soon I am back on sane connection (~1 week) (I cut it in three to make it easier for people to digest ...) Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-17 11:57 ` Tejun Heo 2010-11-17 15:39 ` Serge E. Hallyn @ 2010-11-17 22:17 ` Matt Helsley [not found] ` <20101117221713.GA27736-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 1 sibling, 1 reply; 49+ messages in thread From: Matt Helsley @ 2010-11-17 22:17 UTC (permalink / raw) To: Tejun Heo Cc: Oren Laadan, Gene Cooperman, Matt Helsley, Kapil Arya, ksummit-2010-discuss, linux-kernel, hch, Linux Containers On Wed, Nov 17, 2010 at 12:57:40PM +0100, Tejun Heo wrote: > Hello, Oren. > > On 11/07/2010 10:59 PM, Oren Laadan wrote: <snip> > > > Even if that happens (which is very unlikely and unnecessary), > > it will generate all the very same code in the kernel that Tejun > > has been complaining about, and _more_. And we will still suffer > > from issues such as lack of atomicity and being unable to do many > > simple and advanced optimizations. > > It may be harder but those will be localized for specific features > which would be useful for other purposes too. With in-kernel CR, > you're adding a bunch of intrusive changes which can't be tested or > used apart from CR. You seem to be arguing "Z is only testable/useful for doing the things Z was made for". I couldn't agree more with that. CR is useful for: Fault-tolerance (typical HPC) Load-balancing (less-typical HPC) Debugging (simple [e.g. instead of coredumps] or complex time-reversible) Embedded devices that need to deal with persistent low-memory situations. I think Oren's Kernel Summit presentation succinctly summarized these: http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf My personal favorite idea (that hasn't been implemented yet) is an application startup cache. I've been wondering if caching bash startup after all the shared libraries have been searched, loaded, and linked couldn't save a bunch of time spent in shell scripts. Post-link actually seems like a checkpoint in application startup which would be generally useful too. Of course you'd want to flush [portions of] the cache when packages get upgraded/removed or shell PATHs change and the caches would have to be per-user. I'm less confident but still curious about caching after running rc scripts (less confident because it would depend highly on the content of the rc scripts). A scripted boot, for example, might be able to save some time if the same rc scripts are run and they don't vary over time. That in turn might be useful for carefully-tuned boots on embedded devices. That said we don't currently have code for application caching. Yet we can't be expected to write tools for every possible use of our API in order to show just how true your tautology is. > > > Or we could use linux-cr for that: do the c/r in the kernel, > > keep the know-how in the kernel, expose (and commit to) a > > per-kernel-version ABI (not vow to keep countless new individual Oren, that statement might be read to imply that it's based on something as useless as kernel version numbers. Arnd has pointed out in the past how unsuitable that is and I tend to agree. There are at least two possible things we can relate it to: the SHA of the compiled kernel tree (which doesn't quite work because it assumes everybody uses git trees :( ), or perhaps the SHA/hash of the cpp-processed checkpoint_hdr.h. We could also stuff that header into the kernel (much like kconfigs are output from /proc) for programs that want the kernel to describe the ABI to them. > > ABIs forever after getting them wrongly...), be able to do all > > sorts of useful optimization and provide atomicity and guarantees > > (see under "leak detection" in the OLS linux-cr paper). Also, > > once the c/r infrastructure is in the kernel, it will be easy > > (and encouraged) to support new =ly introduced features. > > And the only reason it seems easier is because you're working around > the ABI problem by declaring that these binary blobs wouldn't be kept > compatible between different kernel versions and configurations. That Not true. First of all, if you look at checkpoint_hdr.h, the contents and layout of the structs don't vary according to kernel configurations. Secondly, we have taken measures to increase the likelihood that the structures will remain compatible. We've designed them to layout the same on 32-bit and 64-bit variants of an arch. We add to the end of the structs. We use an explicit length field in a header to each section to ensure that changes in the size of the structures don't necessarily break compatibility. That said, yes, these measures don't absolutely preclude incompatibility. They will however make compatibility more likely. Then there's the fact that structures like siginfo (for example) so rarely change because they're already part of an ABI. That in turn means that the corresponding checkpoint ABI rarely changes (we don't reuse the existing struct because that would require compat-syscall-style code). Most of the time, in fact, the fields we output are there only because they reflect the 'model' of how things work that the kernel presents to userspace. That model also rarely changes (we've never gotten rid of the POSIX concept of process groups in one extreme example). Perhaps the closest thing we have to wholly-kernel-internal data structures are the signal/sighand structs which echo the way these fields are split from the task struct and shared between tasks. Though I'd argue that gets back into the 'model' presented to userspace (via fork/clone) anyway... I'd estimate that the biggest 'model' changes have come via various filesystem interfaces over the years. We don't checkpoint tasks with open sysfs, /proc, or debugfs files (for example) so that's not part of our ABI and we don't intend to make it so. Nor do we output wholly kernel-internal structures and fields that are often chosen for their performance benefits (e.g. rbtrees, linked lists, hash tables, idrs, various caches, locks, RCU heads, refcounts, etc). So the kernel is free to change implementations without affecting our ABI. The compatibility has natural limits. For instance we can't ever restart an x86_64 binary on a 32-bit kernel. If you add a new syscall interface (e.g. fanotify) then you can't use a checkpoint of a task that makes use of it on fanotify-disabled kernels. Yet these limitations exist no matter where or how you choose to implement checkpoint/restart. We've made almost every effort at making this a proper ABI (I say 'almost' because we still need to export a description of it at runtime and we need to do something better in place of the logfd output). Still, the essentials of a proper checkpoint/restart ABI are already there. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <20101117221713.GA27736-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>]
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <20101117221713.GA27736-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> @ 2010-11-18 10:06 ` Tejun Heo 2010-11-18 20:25 ` Oren Laadan 1 sibling, 0 replies; 49+ messages in thread From: Tejun Heo @ 2010-11-18 10:06 UTC (permalink / raw) To: Matt Helsley Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA, ksummit-2010-discuss-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Linux Containers, hch-jcswGhMUV9g Hello, Matt. On 11/17/2010 11:17 PM, Matt Helsley wrote: >> It may be harder but those will be localized for specific features >> which would be useful for other purposes too. With in-kernel CR, >> you're adding a bunch of intrusive changes which can't be tested or >> used apart from CR. > > You seem to be arguing "Z is only testable/useful for doing the things Z > was made for". I couldn't agree more with that. CR is useful for: I'm saying it's way too narrow scoped and inflexible to be a kernel feature. Kernel features should be like the basic tools, you know, hammers, saws, drills and stuff. In-kernel CR is more like an over complicated food processor which usually sits in the top drawer after first several runs, > Fault-tolerance (typical HPC) > Load-balancing (less-typical HPC) > Debugging (simple [e.g. instead of coredumps] or complex > time-reversible) > Embedded devices that need to deal with persistent low-memory > situations. which can do all of the above, a lot of which can be achieved in less messy way than putting the whole thing inside the kernel. > My personal favorite idea (that hasn't been implemented yet) is an > application startup cache. I've been wondering if caching bash startup > after all the shared libraries have been searched, loaded, and linked > couldn't save a bunch of time spent in shell scripts. Post-link actually > seems like a checkpoint in application startup which would be generally > useful too. Of course you'd want to flush [portions of] the cache when > packages get upgraded/removed or shell PATHs change and the caches > would have to be per-user. What does that have anything to do with the kernel? If you want post-link cache, implement it in ld.so where it belongs. That's like using food processor to mix cement. > I'm less confident but still curious about caching after running rc > scripts (less confident because it would depend highly on the content > of the rc scripts). A scripted boot, for example, might be able to save > some time if the same rc scripts are run and they don't vary over time. > That in turn might be useful for carefully-tuned boots on embedded devices. > > That said we don't currently have code for application caching. Yet we > can't be expected to write tools for every possible use of our API in > order to show just how true your tautology is. Continuing the same line of thought. It _CAN_ be used to do that in a convoluted way but there are better ways to solve those problems. > Most of the time, in fact, the fields we output are there only because > they reflect the 'model' of how things work that the kernel presents to > userspace. That model also rarely changes (we've never gotten rid of the > POSIX concept of process groups in one extreme example). Perhaps the > closest thing we have to wholly-kernel-internal data structures are the > signal/sighand structs which echo the way these fields are split from the > task struct and shared between tasks. Though I'd argue that gets back into > the 'model' presented to userspace (via fork/clone) anyway... Yeah, exactly, so just do it inside the established ABI extending where it makes sense. No reason to add a whole separate set. Thanks. -- tejun ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <20101117221713.GA27736-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org> 2010-11-18 10:06 ` Tejun Heo @ 2010-11-18 20:25 ` Oren Laadan 1 sibling, 0 replies; 49+ messages in thread From: Oren Laadan @ 2010-11-18 20:25 UTC (permalink / raw) To: Matt Helsley Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA, ksummit-2010-discuss-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Tejun Heo, Linux Containers, hch-jcswGhMUV9g On 11/17/2010 05:17 PM, Matt Helsley wrote: > On Wed, Nov 17, 2010 at 12:57:40PM +0100, Tejun Heo wrote: >> Hello, Oren. >> >> On 11/07/2010 10:59 PM, Oren Laadan wrote: <snip> >>> Or we could use linux-cr for that: do the c/r in the kernel, >>> keep the know-how in the kernel, expose (and commit to) a >>> per-kernel-version ABI (not vow to keep countless new individual > > Oren, that statement might be read to imply that it's based on > something as useless as kernel version numbers. Arnd has pointed out in the > past how unsuitable that is and I tend to agree. There are at least two > possible things we can relate it to: the SHA of the compiled kernel tree > (which doesn't quite work because it assumes everybody uses git trees :( ), > or perhaps the SHA/hash of the cpp-processed checkpoint_hdr.h. We could > also stuff that header into the kernel (much like kconfigs are output from > /proc) for programs that want the kernel to describe the ABI to them. BTW, it's the same for userspace c/r: for the same set of features, the format (ABI) remains unchanged. Adding features breaks this and a new version is necessary, and conversion from old to new will be needed. Moreover, supporting a new feature in userspace means adding the proper API/ABI in the kernel, including refactoring etc, which is even harder than adding the support for it in linux-cr. Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
[parent not found: <AANLkTimDXKsBCxbsEOfgkYV2R8FK=bhFdmx9UQow5hqp@mail.gmail.com>]
[parent not found: <4CD5DCE3.3000109@cs.columbia.edu>]
[parent not found: <20101107194222.GG31077@sundance.ccs.neu.edu>]
[parent not found: <4CD71A6B.3020905@cs.columbia.edu>]
[parent not found: <20101107230516.GJ31077@sundance.ccs.neu.edu>]
[parent not found: <4CD774CA.8030004@cs.columbia.edu>]
[parent not found: <20101108162630.GN31077@sundance.ccs.neu.edu>]
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch [not found] ` <20101108162630.GN31077@sundance.ccs.neu.edu> @ 2010-11-08 18:14 ` Oren Laadan 2010-11-08 18:37 ` Gene Cooperman 0 siblings, 1 reply; 49+ messages in thread From: Oren Laadan @ 2010-11-08 18:14 UTC (permalink / raw) To: Gene Cooperman Cc: Kapil Arya, Tejun Heo, ksummit-2010-discuss, linux-kernel, hch, Linux Containers Hi, Ok, I'll bite the bullet for now - to be continued... Just one important clarification: >> Linux-cr can do live migration - e.g. VDI, move the desktop - in >> which case skype's sockets' network stacks are reconstructed, >> transparently to both skype (local apps) and the peer (remote apps). >> Then, at the destination host and skype continues to work. > > That's a really cool thing to do, and it's definitely not part of what > DMTCP does. It might be possible to do userland live migration, > but it's definitely not part of our current scope. But if we're talking > about live migration, have you also looked at the work of > Andres Lagar Caviilla on SnowFlock? > http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf > He does live migration of entire virtual machines, again with very > small delay. Of course, the issue for any type of live migration is that > if the rate of dirtying pages is very high (e.g. HPC), then there is > still a delay or slow response, due to page faults to a remote host. VMware, Xen and KVM already do live migration. However, VMs are a separate beast. We are concerned about _application_ level c/r and migration (complete containers or individual applications). Many proven techniques from the VM world apply to our context too (in your example, post-copy migration). Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 18:14 ` Oren Laadan @ 2010-11-08 18:37 ` Gene Cooperman 2010-11-08 19:34 ` Oren Laadan 0 siblings, 1 reply; 49+ messages in thread From: Gene Cooperman @ 2010-11-08 18:37 UTC (permalink / raw) To: Oren Laadan Cc: Gene Cooperman, Kapil Arya, Tejun Heo, ksummit-2010-discuss, linux-kernel, hch, Linux Containers Thanks for the careful response, Oren. For others who read this, one could interpret Oren's rapid post as criticizing the work of Andres Lagar Cavilla. I'm sure that this was not Oren's intention. Please read below for a brief clarification of the novelty of SnowFlock. Anyway, I really look forward to the phone discussion. I've also enjoyed our interchange, for giving me an opportunity to explain more about the DMTCP design. Thank you. Best wishes, - Gene On Mon, Nov 08, 2010 at 01:14:12PM -0500, Oren Laadan wrote: > Hi, > > Ok, I'll bite the bullet for now - to be continued... > > Just one important clarification: > > >>Linux-cr can do live migration - e.g. VDI, move the desktop - in > >>which case skype's sockets' network stacks are reconstructed, > >>transparently to both skype (local apps) and the peer (remote apps). > >>Then, at the destination host and skype continues to work. > > > >That's a really cool thing to do, and it's definitely not part of what > >DMTCP does. It might be possible to do userland live migration, > >but it's definitely not part of our current scope. But if we're talking > >about live migration, have you also looked at the work of > >Andres Lagar Caviilla on SnowFlock? > > http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf > >He does live migration of entire virtual machines, again with very > >small delay. Of course, the issue for any type of live migration is that > >if the rate of dirtying pages is very high (e.g. HPC), then there is > >still a delay or slow response, due to page faults to a remote host. > > VMware, Xen and KVM already do live migration. However, VMs > are a separate beast. I absolutely agree with your point that live migration of applications is a different beast, and technically very novel. Since I know Andres Lagar Cavilla personally, I also feel obligated to comment why SnowFlock truly is novel in the VM space. First, as Andres writes: "SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3 VMM [Barham 2003]." In the abstract, Andres points out one of the major points of novelty: "To evaluate SnowFlock, we focus on the demanding scenario of services requiring on-the-fly creation of hundreds of parallel workers in order to solve computationallyintensive queries in seconds." We must be careful that we don't destroy someone's reputation without a careful study of their work. > We are concerned about _application_ level c/r and migration > (complete containers or individual applications). Many proven > techniques from the VM world apply to our context too (in your > example, post-copy migration). > > Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch 2010-11-08 18:37 ` Gene Cooperman @ 2010-11-08 19:34 ` Oren Laadan 0 siblings, 0 replies; 49+ messages in thread From: Oren Laadan @ 2010-11-08 19:34 UTC (permalink / raw) To: Gene Cooperman; +Cc: Kapil Arya, Tejun Heo, linux-kernel, hch, Linux Containers On 11/08/2010 01:37 PM, Gene Cooperman wrote: > Thanks for the careful response, Oren. For others who read this, > one could interpret Oren's rapid post as criticizing the work of > Andres Lagar Cavilla. I'm sure that this was not Oren's intention. > Please read below for a brief clarification of the novelty of SnowFlock. Err... yes, that was careless of me. I was too focused on getting the thread back to track. Thanks for pointing out. >>> about live migration, have you also looked at the work of >>> Andres Lagar Caviilla on SnowFlock? >>> http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf >>> He does live migration of entire virtual machines, again with very >>> small delay. Of course, the issue for any type of live migration is that >>> if the rate of dirtying pages is very high (e.g. HPC), then there is >>> still a delay or slow response, due to page faults to a remote host. >> >> VMware, Xen and KVM already do live migration. However, VMs >> are a separate beast. > > I absolutely agree with your point that live migration of > applications is a different beast, and technically very novel. > Since I know Andres Lagar Cavilla personally, I also feel obligated > to comment why SnowFlock truly is novel in the VM space. First, as Andres > writes: > "SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3 > VMM [Barham 2003]." > In the abstract, Andres points out one of the major points of novelty: > "To evaluate SnowFlock, we focus on the demanding > scenario of services requiring on-the-fly creation of hundreds > of parallel workers in order to solve computationallyintensive > queries in seconds." > We must be careful that we don't destroy someone's reputation without > a careful study of their work. Yes, it's really nice work - I saw it when I visited there. (Coincidentally the post-copy idea with Xen appeared also in VEE 09 briefly before). Oren. ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2010-11-29 4:09 UTC | newest]
Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.64.1011021530470.12128@takamine.ncl.cs.columbia.edu>
[not found] ` <4CD08419.5050803@kernel.org>
[not found] ` <AANLkTinOg6n3ZA+0gHzw9LouRuUmJ7DJwHtABRy5c=gM@mail.gmail.com>
[not found] ` <4CD26948.7050009@kernel.org>
[not found] ` <20101104164401.GC10656@sundance.ccs.neu.edu>
[not found] ` <4CD3CE29.2010105@kernel.org>
[not found] ` <20101106053204.GB12449@count0.beaverton.ibm.com>
[not found] ` <20101106204008.GA31077@sundance.ccs.neu.edu>
2010-11-07 21:44 ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Oren Laadan
2010-11-07 23:31 ` Gene Cooperman
[not found] ` <4CD5D99A.8000402@cs.columbia.edu>
[not found] ` <20101107184927.GF31077@sundance.ccs.neu.edu>
[not found] ` <20101107184927.GF31077-Rl5vdzG4YPwx/1z6v04GWfZ8FUJU4vz8@public.gmane.org>
2010-11-07 21:59 ` Oren Laadan
2010-11-17 11:57 ` Tejun Heo
2010-11-17 15:39 ` Serge E. Hallyn
2010-11-17 15:46 ` Tejun Heo
2010-11-18 9:13 ` Pavel Emelyanov
[not found] ` <4CE4EE21.6050305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-11-18 9:48 ` Tejun Heo
2010-11-18 20:13 ` Jose R. Santos
2010-11-19 3:54 ` Serge Hallyn
2010-11-18 19:53 ` Oren Laadan
2010-11-19 4:10 ` Serge Hallyn
2010-11-19 14:04 ` Tejun Heo
2010-11-20 18:05 ` Oren Laadan
[not found] ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2010-11-19 14:36 ` Kirill Korotaev
[not found] ` <04F4899E-B5C7-4BAF-8F2F-05D507A91408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-11-19 15:33 ` Tejun Heo
2010-11-19 16:00 ` Alexey Dobriyan
2010-11-19 16:01 ` Alexey Dobriyan
2010-11-19 16:10 ` Tejun Heo
2010-11-19 16:25 ` Alexey Dobriyan
2010-11-19 16:06 ` Tejun Heo
2010-11-19 16:16 ` Alexey Dobriyan
2010-11-19 16:19 ` Tejun Heo
2010-11-19 16:27 ` Alexey Dobriyan
[not found] ` <AANLkTin7kd3crS+fTLLea5PhAii7B3dz=n7p7YtQ6d4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-19 16:32 ` Tejun Heo
2010-11-19 16:38 ` Alexey Dobriyan
2010-11-19 16:50 ` Tejun Heo
2010-11-19 16:55 ` Alexey Dobriyan
2010-11-20 17:58 ` Oren Laadan
2010-11-20 18:08 ` Oren Laadan
2010-11-20 18:11 ` Oren Laadan
[not found] ` <4CE69B8C.6050606-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-11-20 18:15 ` Oren Laadan
2010-11-20 19:33 ` Tejun Heo
2010-11-21 8:18 ` Gene Cooperman
2010-11-21 8:21 ` Gene Cooperman
2010-11-22 18:02 ` Sukadev Bhattiprolu
2010-11-23 17:53 ` Oren Laadan
2010-11-24 3:50 ` Kapil Arya
2010-11-25 16:04 ` Oren Laadan
2010-11-29 4:09 ` Gene Cooperman
2010-11-21 22:41 ` Grant Likely
2010-11-22 17:34 ` Oren Laadan
2010-11-22 17:18 ` Oren Laadan
2010-11-17 22:17 ` Matt Helsley
[not found] ` <20101117221713.GA27736-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-11-18 10:06 ` Tejun Heo
2010-11-18 20:25 ` Oren Laadan
[not found] ` <AANLkTimDXKsBCxbsEOfgkYV2R8FK=bhFdmx9UQow5hqp@mail.gmail.com>
[not found] ` <4CD5DCE3.3000109@cs.columbia.edu>
[not found] ` <20101107194222.GG31077@sundance.ccs.neu.edu>
[not found] ` <4CD71A6B.3020905@cs.columbia.edu>
[not found] ` <20101107230516.GJ31077@sundance.ccs.neu.edu>
[not found] ` <4CD774CA.8030004@cs.columbia.edu>
[not found] ` <20101108162630.GN31077@sundance.ccs.neu.edu>
2010-11-08 18:14 ` Oren Laadan
2010-11-08 18:37 ` Gene Cooperman
2010-11-08 19:34 ` Oren Laadan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox