Linux Container Development
 help / color / mirror / Atom feed
* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]             ` <20101106204008.GA31077@sundance.ccs.neu.edu>
@ 2010-11-07 21:44               ` Oren Laadan
  2010-11-07 23:31                 ` Gene Cooperman
       [not found]               ` <4CD5D99A.8000402@cs.columbia.edu>
  1 sibling, 1 reply; 49+ messages in thread
From: Oren Laadan @ 2010-11-07 21:44 UTC (permalink / raw)
  To: Gene Cooperman
  Cc: Matt Helsley, Tejun Heo, Kapil Arya, ksummit-2010-discuss,
	linux-kernel, hch, Linux Containers

[cc'ing linux containers mailing list]

On 11/06/2010 04:40 PM, Gene Cooperman wrote:

> 8.  What happens if the DMTCP coordinator ( checkpoint control process) dies?
>    [ The same thing that happens if a user process dies.  We kill the whole
>      computation, and restart.  At restart, we use a new coordinator.
>      Coordinators are stateless. ]

My experience is different:

I downloaded dmtcp and followed the quick-start guide:
(1) "dmtcp_coordinator" on one terminal
(2) "dmtcp_checkpoint bash" on another terminal

Then I:
(3) pkill -9 dmtcp_coordinator
... oops - 'bash' died.

I didn't even try to take a checkpoint :(

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                   ` <20101107184927.GF31077-Rl5vdzG4YPwx/1z6v04GWfZ8FUJU4vz8@public.gmane.org>
@ 2010-11-07 21:59                     ` Oren Laadan
  2010-11-17 11:57                       ` Tejun Heo
  0 siblings, 1 reply; 49+ messages in thread
From: Oren Laadan @ 2010-11-07 21:59 UTC (permalink / raw)
  To: Gene Cooperman
  Cc: Kapil Arya, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ksummit-2010-discuss-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Tejun Heo, Linux Containers, hch-jcswGhMUV9g

[cc'ing linux containers mailing list]

On 11/07/2010 01:49 PM, Gene Cooperman wrote:

[snip]

> Matt had asked how we would handle inotify(), but I was getting swamped
> by all the questions.  There is a virtualization approach to inotify in which
> one puts wrappers around inotify_add_watch(), inotify_rm_watch() and
> friends in the same way as we wrap open() and could wrap close().
> One would then need to wrap read() (which we don't like to do, just

This sounds like reimplementation in userspace the very same logic
done by the kernel :)

> in case it could add significant overhead).  But if we consider kernel
> and userland virtualization together, then something similar to  TIOCSTI
> for ioctl would allow us to avoid wrapping read().

We could work to add ABIs and APIs for each and every possible piece
of state that affects userspace. And for each we'll argue forever
about the design and some time later regret that it wasn't designed
correctly :p

Even if that happens (which is very unlikely and unnecessary),
it will generate all the very same code in the kernel that Tejun
has been complaining about, and _more_. And we will still suffer
from issues such as lack of atomicity and being unable to do many
simple and advanced optimizations.

Or we could use linux-cr for that: do the c/r in the kernel,
keep the know-how in the kernel, expose (and commit to) a
per-kernel-version ABI (not vow to keep countless new individual
ABIs forever after getting them wrongly...), be able to do all
sorts of useful optimization and provide atomicity and guarantees
(see under "leak detection" in the OLS linux-cr paper). Also,
once the c/r infrastructure is in the kernel, it will be easy
(and encouraged) to support new =ly introduced features.

Finally, then we would use dmtcp as well as other tools on top
of the kernel-cr - and I'm looking forward to do that !

[snip]

>> Hmm... can you really c/r from userspace a process that was, at
>> checkpoint time, in a ptrace-stopped state at an arbitrary kernel
>> ptrace-hook ?  I strongly suspect the answer is "no", definitely
>> not unless you also virtualize and replicate the entire in-kernel
>> ptrace functionality in userspace,
>
> Let's try it and see.  If you write a program, we'll try it out in
> DMTCP (unstable branch) and see.  So far, checkpointing gdb sessions
> has worked well for us.  If there is something we don't cover, it will
> be helpful to both of us to find it, and analyze that case.

Try "strace bash" :)
I suspect it won't work - and for the reasons I described.

[snip]

>> (Now looking forward to discuss more details with dmtcp team on
>> Tuesday and on :)
>
> Also a very good point above, and I agree.  The offline discussion should
> be a better forum for putting this all into perspective.
>
> Thanks again for your thoughtful response,

Same here. Talk to you soon...

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-07 21:44               ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Oren Laadan
@ 2010-11-07 23:31                 ` Gene Cooperman
  0 siblings, 0 replies; 49+ messages in thread
From: Gene Cooperman @ 2010-11-07 23:31 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Gene Cooperman, Matt Helsley, Tejun Heo, Kapil Arya,
	ksummit-2010-discuss, linux-kernel, hch, Linux Containers

On Sun, Nov 07, 2010 at 04:44:20PM -0500, Oren Laadan wrote:
> [cc'ing linux containers mailing list]
> 
> On 11/06/2010 04:40 PM, Gene Cooperman wrote:
> 
> >8.  What happens if the DMTCP coordinator ( checkpoint control process) dies?
> >   [ The same thing that happens if a user process dies.  We kill the whole
> >     computation, and restart.  At restart, we use a new coordinator.
> >     Coordinators are stateless. ]
> 
> My experience is different:
> 
> I downloaded dmtcp and followed the quick-start guide:
> (1) "dmtcp_coordinator" on one terminal
> (2) "dmtcp_checkpoint bash" on another terminal
> 
> Then I:
> (3) pkill -9 dmtcp_coordinator
> ... oops - 'bash' died.
> 
> I didn't even try to take a checkpoint :(

You're right.  I just reproduced your example.  But please remember that
we're working in a design space where if any process of a computation
dies, then we kill the computation and restart.  It doesn't matter to us
if it's a user process or the DMTCP coordinator that died.  I do think
this is getting too detailed for the LKML list, but since you bring it
up, here is the analysis.  The user bash process exits with:

[31331] ERROR at dmtcpmessagetypes.cpp:62 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
     _magicBits = 
Message: read invalid message, _magicBits mismatch.  Did DMTCP coordinator die uncleanly?

This means that when the DMTCP coordinator died, it sent a message to the
checkpoint thread within the user process.  The message was ill-formed.
The current DMTCP code says that if a checkpoint thread receives an
ill-formed message from the coordinator, then it should die.  It's not
hard to change the protocol between DMTCP coordinator and checkpoint
thread of the user process into a more robust protocol with RETRY, further
ACK, etc.  We haven't done this.  Right now, the user simply restarts from
the last checkpoint.  If one process of a computation has been compromised
(either DMTCP coordinator or user process), then the whole computation
has been compromised.  I think in a previous version of DMTCP, the policy
was to allow the computation to continue when the coordinator dies.
Policies change.

But I think you're missing the larger point.  We've developed DMTCP
over six years, largely with programmers who are much less experienced
than the kernel developers.  Yet DMTCP works reliably for many users.
I consider this a credit to the DMTCP design.  The Linux C/R design
is also excellent.

Can we get back to questions of design, using the implementations as
reference implementations?  If you don't object, I'll also skip replying
to the other post, since I think we're getting too detailed.  I'm having
trouble keeping up with the posts.  :-)  An offline discussion will
give us time to look more carefully at these issues, and draw more
careful conclusions.

Thanks,
- Gene

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                       ` <20101108162630.GN31077@sundance.ccs.neu.edu>
@ 2010-11-08 18:14                         ` Oren Laadan
  2010-11-08 18:37                           ` Gene Cooperman
  0 siblings, 1 reply; 49+ messages in thread
From: Oren Laadan @ 2010-11-08 18:14 UTC (permalink / raw)
  To: Gene Cooperman
  Cc: Kapil Arya, Tejun Heo, ksummit-2010-discuss, linux-kernel, hch,
	Linux Containers

Hi,

Ok, I'll bite the bullet for now - to be continued...

Just one important clarification:

>> Linux-cr can do live migration - e.g. VDI, move the desktop - in
>> which case skype's sockets' network stacks are reconstructed,
>> transparently to both skype (local apps) and the peer (remote apps).
>> Then, at the destination host and skype continues to work.
>
> That's a really cool thing to do, and it's definitely not part of what
> DMTCP does.  It might be possible to do userland live migration,
> but it's definitely not part of our current scope.  But if we're talking
> about live migration, have you also looked at the work of
> Andres Lagar Caviilla on SnowFlock?
>      http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf
> He does live migration of entire virtual machines, again with very
> small delay.  Of course, the issue for any type of live migration is that
> if the rate of dirtying pages is very high (e.g. HPC), then there is
> still a delay or slow response, due to page faults to a remote host.

VMware, Xen and KVM already do live migration. However, VMs
are a separate beast.

We are concerned about _application_ level c/r and migration
(complete containers or individual applications). Many proven
techniques from the VM world apply to our context too (in your
example, post-copy migration).

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-08 18:14                         ` Oren Laadan
@ 2010-11-08 18:37                           ` Gene Cooperman
  2010-11-08 19:34                             ` Oren Laadan
  0 siblings, 1 reply; 49+ messages in thread
From: Gene Cooperman @ 2010-11-08 18:37 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Gene Cooperman, Kapil Arya, Tejun Heo, ksummit-2010-discuss,
	linux-kernel, hch, Linux Containers

Thanks for the careful response, Oren.  For others who read this,
one could interpret Oren's rapid post as criticizing the work of
Andres Lagar Cavilla.  I'm sure that this was not Oren's intention.
Please read below for a brief clarification of the novelty of SnowFlock.
    Anyway, I really look forward to the phone discussion.  I've also
enjoyed our interchange, for giving me an opportunity to explain more about
the DMTCP design.  Thank you.
                                                        Best wishes,
                                                        - Gene

On Mon, Nov 08, 2010 at 01:14:12PM -0500, Oren Laadan wrote:
> Hi,
> 
> Ok, I'll bite the bullet for now - to be continued...
> 
> Just one important clarification:
> 
> >>Linux-cr can do live migration - e.g. VDI, move the desktop - in
> >>which case skype's sockets' network stacks are reconstructed,
> >>transparently to both skype (local apps) and the peer (remote apps).
> >>Then, at the destination host and skype continues to work.
> >
> >That's a really cool thing to do, and it's definitely not part of what
> >DMTCP does.  It might be possible to do userland live migration,
> >but it's definitely not part of our current scope.  But if we're talking
> >about live migration, have you also looked at the work of
> >Andres Lagar Caviilla on SnowFlock?
> >     http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf
> >He does live migration of entire virtual machines, again with very
> >small delay.  Of course, the issue for any type of live migration is that
> >if the rate of dirtying pages is very high (e.g. HPC), then there is
> >still a delay or slow response, due to page faults to a remote host.
> 
> VMware, Xen and KVM already do live migration. However, VMs
> are a separate beast.

I absolutely agree with your point that live migration of
applications is a different beast, and technically very novel.
    Since I know Andres Lagar Cavilla personally, I also feel obligated
to comment why SnowFlock truly is novel in the VM space.  First, as Andres
writes:
"SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3
VMM [Barham 2003]."
In the abstract, Andres points out one of the major points of novelty:
"To evaluate SnowFlock, we focus on the demanding
scenario of services requiring on-the-fly creation of hundreds
of parallel workers in order to solve computationallyintensive
queries in seconds."
We must be careful that we don't destroy someone's reputation without
a careful study of their work.

> We are concerned about _application_ level c/r and migration
> (complete containers or individual applications). Many proven
> techniques from the VM world apply to our context too (in your
> example, post-copy migration).
> 
> Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-08 18:37                           ` Gene Cooperman
@ 2010-11-08 19:34                             ` Oren Laadan
  0 siblings, 0 replies; 49+ messages in thread
From: Oren Laadan @ 2010-11-08 19:34 UTC (permalink / raw)
  To: Gene Cooperman; +Cc: Kapil Arya, Tejun Heo, linux-kernel, hch, Linux Containers



On 11/08/2010 01:37 PM, Gene Cooperman wrote:
> Thanks for the careful response, Oren.  For others who read this,
> one could interpret Oren's rapid post as criticizing the work of
> Andres Lagar Cavilla.  I'm sure that this was not Oren's intention.
> Please read below for a brief clarification of the novelty of SnowFlock.

Err... yes, that was careless of me. I was too focused on
getting the thread back to track. Thanks for pointing out.

>>> about live migration, have you also looked at the work of
>>> Andres Lagar Caviilla on SnowFlock?
>>>      http://andres.lagarcavilla.com/publications/LagarCavillaEurosys09.pdf
>>> He does live migration of entire virtual machines, again with very
>>> small delay.  Of course, the issue for any type of live migration is that
>>> if the rate of dirtying pages is very high (e.g. HPC), then there is
>>> still a delay or slow response, due to page faults to a remote host.
>>
>> VMware, Xen and KVM already do live migration. However, VMs
>> are a separate beast.
>
> I absolutely agree with your point that live migration of
> applications is a different beast, and technically very novel.
>      Since I know Andres Lagar Cavilla personally, I also feel obligated
> to comment why SnowFlock truly is novel in the VM space.  First, as Andres
> writes:
> "SnowFlock is an open-source project [SnowFlock] built on the Xen 3.0.3
> VMM [Barham 2003]."
> In the abstract, Andres points out one of the major points of novelty:
> "To evaluate SnowFlock, we focus on the demanding
> scenario of services requiring on-the-fly creation of hundreds
> of parallel workers in order to solve computationallyintensive
> queries in seconds."
> We must be careful that we don't destroy someone's reputation without
> a careful study of their work.

Yes, it's really nice work - I saw it when I visited there.
(Coincidentally the post-copy idea with Xen appeared also in
VEE 09 briefly before).

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-07 21:59                     ` Oren Laadan
@ 2010-11-17 11:57                       ` Tejun Heo
  2010-11-17 15:39                         ` Serge E. Hallyn
  2010-11-17 22:17                         ` Matt Helsley
  0 siblings, 2 replies; 49+ messages in thread
From: Tejun Heo @ 2010-11-17 11:57 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Gene Cooperman, Matt Helsley, Kapil Arya, ksummit-2010-discuss,
	linux-kernel, hch, Linux Containers

Hello, Oren.

On 11/07/2010 10:59 PM, Oren Laadan wrote:
> We could work to add ABIs and APIs for each and every possible piece
> of state that affects userspace. And for each we'll argue forever
> about the design and some time later regret that it wasn't designed
> correctly :p

I'm sorry but in-kernel CR already looks like a major misdesign to me.

> Even if that happens (which is very unlikely and unnecessary),
> it will generate all the very same code in the kernel that Tejun
> has been complaining about, and _more_. And we will still suffer
> from issues such as lack of atomicity and being unable to do many
> simple and advanced optimizations.

It may be harder but those will be localized for specific features
which would be useful for other purposes too.  With in-kernel CR,
you're adding a bunch of intrusive changes which can't be tested or
used apart from CR.

> Or we could use linux-cr for that: do the c/r in the kernel,
> keep the know-how in the kernel, expose (and commit to) a
> per-kernel-version ABI (not vow to keep countless new individual
> ABIs forever after getting them wrongly...), be able to do all
> sorts of useful optimization and provide atomicity and guarantees
> (see under "leak detection" in the OLS linux-cr paper). Also,
> once the c/r infrastructure is in the kernel, it will be easy
> (and encouraged) to support new =ly introduced features.

And the only reason it seems easier is because you're working around
the ABI problem by declaring that these binary blobs wouldn't be kept
compatible between different kernel versions and configurations.  That
simply is the wrong approach.  If you want to export something, build
it properly into ABI.

> Finally, then we would use dmtcp as well as other tools on top
> of the kernel-cr - and I'm looking forward to do that !

Yeah, this part I agree.  The higher level workarounds implemented in
dmtcp are quite impressive and useful no matter what happens to lower
layer.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-17 11:57                       ` Tejun Heo
@ 2010-11-17 15:39                         ` Serge E. Hallyn
  2010-11-17 15:46                           ` Tejun Heo
  2010-11-17 22:17                         ` Matt Helsley
  1 sibling, 1 reply; 49+ messages in thread
From: Serge E. Hallyn @ 2010-11-17 15:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel,
	Matt Helsley, Linux Containers, Eric W. Biederman, xemul

Quoting Tejun Heo (tj@kernel.org):
> Hello, Oren.
> 
> On 11/07/2010 10:59 PM, Oren Laadan wrote:
> > We could work to add ABIs and APIs for each and every possible piece
> > of state that affects userspace. And for each we'll argue forever
> > about the design and some time later regret that it wasn't designed
> > correctly :p
> 
> I'm sorry but in-kernel CR already looks like a major misdesign to me.

By this do you mean the very idea of having CR support in the kernel?
Or our design of it in the kernel?  Let's go back to July 2008, at the
containers mini-summit, where it was unanimously agreed upon that the
kernel was the right place (Checkpoint/Resetart [CR] under
http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
we would start by supporting a single task with no resources.  Was that
whole discussion effectively misguided, in your opinion?  Or do you
feel that since the first steps outlined in that discussion we've
either "gone too far" or strayed in the subsequent design?

-serge

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-17 15:39                         ` Serge E. Hallyn
@ 2010-11-17 15:46                           ` Tejun Heo
  2010-11-18  9:13                             ` Pavel Emelyanov
                                               ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Tejun Heo @ 2010-11-17 15:46 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel,
	Matt Helsley, Linux Containers, Eric W. Biederman, xemul

Hello, Serge.

On 11/17/2010 04:39 PM, Serge E. Hallyn wrote:
>> I'm sorry but in-kernel CR already looks like a major misdesign to me.
> 
> By this do you mean the very idea of having CR support in the kernel?
> Or our design of it in the kernel?

The former, I'm afraid.

> Let's go back to July 2008, at the containers mini-summit, where it
> was unanimously agreed upon that the kernel was the right place
> (Checkpoint/Resetart [CR] under
> http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
> we would start by supporting a single task with no resources.  Was
> that whole discussion effectively misguided, in your opinion?  Or do
> you feel that since the first steps outlined in that discussion
> we've either "gone too far" or strayed in the subsequent design?

The conclusion doesn't seem like such a good idea, well, at least to
me for what it's worth.  Conclusions at summits don't carry decisive
weight.  It'll still have to prove its worthiness for mainline all the
same and in light of already working userland alternative and the
expanded area now covered by virtualization, the arguments in this
thread don't seem too strong.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-17 11:57                       ` Tejun Heo
  2010-11-17 15:39                         ` Serge E. Hallyn
@ 2010-11-17 22:17                         ` Matt Helsley
       [not found]                           ` <20101117221713.GA27736-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  1 sibling, 1 reply; 49+ messages in thread
From: Matt Helsley @ 2010-11-17 22:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oren Laadan, Gene Cooperman, Matt Helsley, Kapil Arya,
	ksummit-2010-discuss, linux-kernel, hch, Linux Containers

On Wed, Nov 17, 2010 at 12:57:40PM +0100, Tejun Heo wrote:
> Hello, Oren.
> 
> On 11/07/2010 10:59 PM, Oren Laadan wrote:

<snip>

> 
> > Even if that happens (which is very unlikely and unnecessary),
> > it will generate all the very same code in the kernel that Tejun
> > has been complaining about, and _more_. And we will still suffer
> > from issues such as lack of atomicity and being unable to do many
> > simple and advanced optimizations.
> 
> It may be harder but those will be localized for specific features
> which would be useful for other purposes too.  With in-kernel CR,
> you're adding a bunch of intrusive changes which can't be tested or
> used apart from CR.

You seem to be arguing "Z is only testable/useful for doing the things Z
was made for". I couldn't agree more with that. CR is useful for:

	Fault-tolerance (typical HPC)
	Load-balancing (less-typical HPC)
	Debugging (simple [e.g. instead of coredumps] or complex
		time-reversible)
	Embedded devices that need to deal with persistent low-memory
		situations.

I think Oren's Kernel Summit presentation succinctly summarized these:
	http://www.cs.columbia.edu/~orenl/talks/ksummit-2010.pdf

My personal favorite idea (that hasn't been implemented yet) is an
application startup cache. I've been wondering if caching bash startup
after all the shared libraries have been searched, loaded, and linked
couldn't save a bunch of time spent in shell scripts. Post-link actually
seems like a checkpoint in application startup which would be generally
useful too. Of course you'd want to flush [portions of] the cache when
packages get upgraded/removed or shell PATHs change and the caches
would have to be per-user.

I'm less confident but still curious about caching after running rc
scripts (less confident because it would depend highly on the content
of the rc scripts). A scripted boot, for example, might be able to save
some time if the same rc scripts are run and they don't vary over time.
That in turn might be useful for carefully-tuned boots on embedded devices.

That said we don't currently have code for application caching. Yet we
can't be expected to write tools for every possible use of our API in
order to show just how true your tautology is.

> 
> > Or we could use linux-cr for that: do the c/r in the kernel,
> > keep the know-how in the kernel, expose (and commit to) a
> > per-kernel-version ABI (not vow to keep countless new individual

Oren, that statement might be read to imply that it's based on
something as useless as kernel version numbers. Arnd has pointed out in the
past how unsuitable that is and I tend to agree. There are at least two
possible things we can relate it to: the SHA of the compiled kernel tree
(which doesn't quite work because it assumes everybody uses git trees :( ),
or perhaps the SHA/hash of the cpp-processed checkpoint_hdr.h. We could
also stuff that header into the kernel (much like kconfigs are output from
/proc) for programs that want the kernel to describe the ABI to them.

> > ABIs forever after getting them wrongly...), be able to do all
> > sorts of useful optimization and provide atomicity and guarantees
> > (see under "leak detection" in the OLS linux-cr paper). Also,
> > once the c/r infrastructure is in the kernel, it will be easy
> > (and encouraged) to support new =ly introduced features.
> 
> And the only reason it seems easier is because you're working around
> the ABI problem by declaring that these binary blobs wouldn't be kept
> compatible between different kernel versions and configurations.  That

Not true. First of all, if you look at checkpoint_hdr.h, the contents and
layout of the structs don't vary according to kernel configurations.
Secondly, we have taken measures to increase the likelihood that the
structures will remain compatible. We've designed them to layout the
same on 32-bit and 64-bit variants of an arch. We add to the end of the
structs. We use an explicit length field in a header to each section
to ensure that changes in the size of the structures don't necessarily
break compatibility.

That said, yes, these measures don't absolutely preclude incompatibility.
They will however make compatibility more likely.

Then there's the fact that structures like siginfo (for example) so rarely
change because they're already part of an ABI. That in turn means that the
corresponding checkpoint ABI rarely changes (we don't reuse the existing
struct because that would require compat-syscall-style code).

Most of the time, in fact, the fields we output are there only because
they reflect the 'model' of how things work that the kernel presents to
userspace. That model also rarely changes (we've never gotten rid of the
POSIX concept of process groups in one extreme example). Perhaps the 
closest thing we have to wholly-kernel-internal data structures are the
signal/sighand structs which echo the way these fields are split from the
task struct and shared between tasks. Though I'd argue that gets back into
the 'model' presented to userspace (via fork/clone) anyway...

I'd estimate that the biggest 'model' changes have come via various
filesystem interfaces over the years. We don't checkpoint tasks with open
sysfs, /proc, or debugfs files (for example) so that's not part of our
ABI and we don't intend to make it so.

Nor do we output wholly kernel-internal structures and fields that are
often chosen for their performance benefits (e.g. rbtrees, linked lists,
hash tables, idrs, various caches, locks, RCU heads, refcounts, etc). So
the kernel is free to change implementations without affecting our ABI.

The compatibility has natural limits. For instance we can't ever
restart an x86_64 binary on a 32-bit kernel. If you add a new syscall
interface (e.g. fanotify) then you can't use a checkpoint of a task that
makes use of it on fanotify-disabled kernels. Yet these limitations exist
no matter where or how you choose to implement checkpoint/restart.

We've made almost every effort at making this a proper ABI (I say
'almost' because we still need to export a description of it at runtime
and we need to do something better in place of the logfd output). Still,
the essentials of a proper checkpoint/restart ABI are already there.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-17 15:46                           ` Tejun Heo
@ 2010-11-18  9:13                             ` Pavel Emelyanov
       [not found]                               ` <4CE4EE21.6050305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2010-11-18 19:53                             ` Oren Laadan
  2010-11-19  4:10                             ` Serge Hallyn
  2 siblings, 1 reply; 49+ messages in thread
From: Pavel Emelyanov @ 2010-11-18  9:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Oren Laadan, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Matt Helsley, Linux Containers,
	Eric W. Biederman

On 11/17/2010 06:46 PM, Tejun Heo wrote:
> Hello, Serge.
> 
> On 11/17/2010 04:39 PM, Serge E. Hallyn wrote:
>>> I'm sorry but in-kernel CR already looks like a major misdesign to me.
>>
>> By this do you mean the very idea of having CR support in the kernel?
>> Or our design of it in the kernel?
> 
> The former, I'm afraid.

Can you elaborate on this please?

>> Let's go back to July 2008, at the containers mini-summit, where it
>> was unanimously agreed upon that the kernel was the right place
>> (Checkpoint/Resetart [CR] under
>> http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
>> we would start by supporting a single task with no resources.  Was
>> that whole discussion effectively misguided, in your opinion?  Or do
>> you feel that since the first steps outlined in that discussion
>> we've either "gone too far" or strayed in the subsequent design?
> 
> The conclusion doesn't seem like such a good idea, well, at least to
> me for what it's worth.  Conclusions at summits don't carry decisive
> weight.  It'll still have to prove its worthiness for mainline all the
> same and in light of already working userland alternative and the
> expanded area now covered by virtualization, the arguments in this
> thread don't seem too strong.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                               ` <4CE4EE21.6050305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-11-18  9:48                                 ` Tejun Heo
  2010-11-18 20:13                                   ` Jose R. Santos
  2010-11-19  3:54                                   ` Serge Hallyn
  0 siblings, 2 replies; 49+ messages in thread
From: Tejun Heo @ 2010-11-18  9:48 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Kapil Arya, Gene Cooperman,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Eric W. Biederman, Linux Containers, Serge E. Hallyn

Hello, Pavel.

On 11/18/2010 10:13 AM, Pavel Emelyanov wrote:
>>> By this do you mean the very idea of having CR support in the kernel?
>>> Or our design of it in the kernel?
>>
>> The former, I'm afraid.
> 
> Can you elaborate on this please?

I think I already did that several times in this thread but here's an
attempt at summary.

* It adds a bunch of pseudo ABI when most of the same information is
  available via already established ABI.

* In a way which can only ever be used and tested by CR.  If possible,
  kernel should provide generic mechanisms which can be used to
  implement features in userland.  One of the reasons why we'd like to
  export small basic building blocks instead of full end-to-end
  solutions from the kernel is that we don't know how things will
  change in the future.  In-kernel CR puts too much in the kernel in a
  way too inflexible manner.

* It essentially adds a separate complete set of entry/exit points for
  a lot of things, which makes things more error prone and increases
  maintenance overhead across the board.

* And, most of all, there are userland implementation and
  virtualization, making the benefit to overhead ratio completely off.
  Userland implementation _already_ achieves most of what's necessary
  for the most important use case of HPC without any special help from
  the kernel.  The only reasonable thing to do is taking a good look
  at it and finding ways to improve it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                           ` <20101117221713.GA27736-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
@ 2010-11-18 10:06                             ` Tejun Heo
  2010-11-18 20:25                             ` Oren Laadan
  1 sibling, 0 replies; 49+ messages in thread
From: Tejun Heo @ 2010-11-18 10:06 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ksummit-2010-discuss-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Linux Containers, hch-jcswGhMUV9g

Hello, Matt.

On 11/17/2010 11:17 PM, Matt Helsley wrote:
>> It may be harder but those will be localized for specific features
>> which would be useful for other purposes too.  With in-kernel CR,
>> you're adding a bunch of intrusive changes which can't be tested or
>> used apart from CR.
> 
> You seem to be arguing "Z is only testable/useful for doing the things Z
> was made for". I couldn't agree more with that. CR is useful for:

I'm saying it's way too narrow scoped and inflexible to be a kernel
feature.  Kernel features should be like the basic tools, you know,
hammers, saws, drills and stuff.  In-kernel CR is more like an over
complicated food processor which usually sits in the top drawer after
first several runs,

> 	Fault-tolerance (typical HPC)
> 	Load-balancing (less-typical HPC)
> 	Debugging (simple [e.g. instead of coredumps] or complex
> 		time-reversible)
> 	Embedded devices that need to deal with persistent low-memory
> 		situations.

which can do all of the above, a lot of which can be achieved in
less messy way than putting the whole thing inside the kernel.

> My personal favorite idea (that hasn't been implemented yet) is an
> application startup cache. I've been wondering if caching bash startup
> after all the shared libraries have been searched, loaded, and linked
> couldn't save a bunch of time spent in shell scripts. Post-link actually
> seems like a checkpoint in application startup which would be generally
> useful too. Of course you'd want to flush [portions of] the cache when
> packages get upgraded/removed or shell PATHs change and the caches
> would have to be per-user.

What does that have anything to do with the kernel?  If you want
post-link cache, implement it in ld.so where it belongs.  That's like
using food processor to mix cement.

> I'm less confident but still curious about caching after running rc
> scripts (less confident because it would depend highly on the content
> of the rc scripts). A scripted boot, for example, might be able to save
> some time if the same rc scripts are run and they don't vary over time.
> That in turn might be useful for carefully-tuned boots on embedded devices.
> 
> That said we don't currently have code for application caching. Yet we
> can't be expected to write tools for every possible use of our API in
> order to show just how true your tautology is.

Continuing the same line of thought.  It _CAN_ be used to do that in a
convoluted way but there are better ways to solve those problems.

> Most of the time, in fact, the fields we output are there only because
> they reflect the 'model' of how things work that the kernel presents to
> userspace. That model also rarely changes (we've never gotten rid of the
> POSIX concept of process groups in one extreme example). Perhaps the 
> closest thing we have to wholly-kernel-internal data structures are the
> signal/sighand structs which echo the way these fields are split from the
> task struct and shared between tasks. Though I'd argue that gets back into
> the 'model' presented to userspace (via fork/clone) anyway...

Yeah, exactly, so just do it inside the established ABI extending
where it makes sense.  No reason to add a whole separate set.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-17 15:46                           ` Tejun Heo
  2010-11-18  9:13                             ` Pavel Emelyanov
@ 2010-11-18 19:53                             ` Oren Laadan
  2010-11-19  4:10                             ` Serge Hallyn
  2 siblings, 0 replies; 49+ messages in thread
From: Oren Laadan @ 2010-11-18 19:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Kapil Arya, Gene Cooperman, linux-kernel,
	Matt Helsley, Linux Containers, Eric W. Biederman, xemul



On 11/17/2010 10:46 AM, Tejun Heo wrote:
> Hello, Serge.
> 
> On 11/17/2010 04:39 PM, Serge E. Hallyn wrote:
>>> I'm sorry but in-kernel CR already looks like a major misdesign to me.
>>
>> By this do you mean the very idea of having CR support in the kernel?
>> Or our design of it in the kernel?
> 
> The former, I'm afraid.
> 
>> Let's go back to July 2008, at the containers mini-summit, where it
>> was unanimously agreed upon that the kernel was the right place
>> (Checkpoint/Resetart [CR] under
>> http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
>> we would start by supporting a single task with no resources.  Was
>> that whole discussion effectively misguided, in your opinion?  Or do
>> you feel that since the first steps outlined in that discussion
>> we've either "gone too far" or strayed in the subsequent design?
> 
> The conclusion doesn't seem like such a good idea, well, at least to
> me for what it's worth.  Conclusions at summits don't carry decisive
> weight.  It'll still have to prove its worthiness for mainline all the
> same and in light of already working userland alternative and the
> expanded area now covered by virtualization, the arguments in this
> thread don't seem too strong.

While it's your opinion that userland alternatives "already work",
in reality they are unsuitable for several real use-cases. The
userland approach has serious restrictions - which I will cover
in a follow-up post to my discussion with Gene soon.

Note that one important point of agreement was that DMTCP's ability
to provide "glue" to restart applications without their original
context is _orthogonal_ to how the core c/r is done. IOW - there
exciting goodies from DMTCP are useful with either form of c/r.

You also argue that "virtualization" (VMs?) covers everything else,
implying that lightweight virtualization is useless. In reality it
is an important technology, already in the kernel (surely you don't
suggest to pull it out ?!) and for a reason. That is already a very
good reason to provide, e.g. containers c/r and live-migration to
keep it competitive and useful.

Thanks,

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-18  9:48                                 ` Tejun Heo
@ 2010-11-18 20:13                                   ` Jose R. Santos
  2010-11-19  3:54                                   ` Serge Hallyn
  1 sibling, 0 replies; 49+ messages in thread
From: Jose R. Santos @ 2010-11-18 20:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Serge E. Hallyn, Oren Laadan, Kapil Arya,
	Gene Cooperman, linux-kernel@vger.kernel.org, Matt Helsley,
	Linux Containers, Eric W. Biederman

On Thu, 18 Nov 2010 10:48:34 +0100
Tejun Heo <tj@kernel.org> wrote:

> Hello, Pavel.
> 
> On 11/18/2010 10:13 AM, Pavel Emelyanov wrote:
> >>> By this do you mean the very idea of having CR support in the
> >>> kernel? Or our design of it in the kernel?
> >>
> >> The former, I'm afraid.
> > 
> > Can you elaborate on this please?
> 
> I think I already did that several times in this thread but here's an
> attempt at summary.

Yet the arguments seem to be vague enough not to be convincing to the
people working on the code.

> * It adds a bunch of pseudo ABI when most of the same information is
>   available via already established ABI.

Can you elaborate on this?  What established ABI are you proposing we
use here.  Hopefully we can turn this into a more technical discussion. 
 
> * In a way which can only ever be used and tested by CR.  If possible,

So what if it can only be tested with CR as long as we can make CR work
on a variety of environments?  Scalability changes for _really_ large
SMP boxes can only be reliably tested by people such equipment.  We are
not imposing any such restriction and this code can be tested on very
wide range of setups.

>   kernel should provide generic mechanisms which can be used to
>   implement features in userland.  One of the reasons why we'd like to
>   export small basic building blocks instead of full end-to-end
>   solutions from the kernel is that we don't know how things will
>   change in the future.  In-kernel CR puts too much in the kernel in a
>   way too inflexible manner.
> 
> * It essentially adds a separate complete set of entry/exit points for
>   a lot of things, which makes things more error prone and increases
>   maintenance overhead across the board.

I partially agree with you here.  There will be maintenance overhead
every time you add code to the kernel that _may_ make changes in the
future more complicated.  This true for _any_ code that is added to the
core kernel.  Now in my experience such maintenance burden is most
disruptive when the code being added creates a lot of new state that
need to be tracked in multiple places unrelated to CR (in this case).
Our argument is that the CR code is not creating new state that will
cause painful future changes to the kernel.  If you have specific
example that you are concerned with, great.  Lets discuss those.

Are we promising zero maintenance cost? But guess what, neither do most
features that make into the kernel.

Now, if we change the argument around...  What would be the maintenance
cost keeping this outside the kernel.  I would argue that it is much
higher and would use SystemTap as the first example that come to mind.

> * And, most of all, there are userland implementation and
>   virtualization, making the benefit to overhead ratio completely off.

Can we keep virtualization out of this.  Every time someone mentions
virtualization as a solution, it makes me feel like these people just
don't understand the problem we are trying to solve.  It is just not
practical to create a new VM for every application you want to CR.
These are two different tools to attack two different problems.

>   Userland implementation _already_ achieves most of what's necessary
>   for the most important use case of HPC without any special help from

What are these _most_ important cases of HPC that you are referring too?
Can we do a lot of these cases from userspace? Sure, but why are the
ones that can't be done from userspace any less important.  If nobody
cared about those, we would not be having this conversation.

>   the kernel.  The only reasonable thing to do is taking a good look
>   at it and finding ways to improve it.

The userspace vs in-kernel discussion has been done before as multiple
people have already said in this thread.  Show me a version of userspace
CR that can correctly do all that an in-kernel implementation is capable
of.

> Thanks.
> 

-- 
Jose R. Santos

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                           ` <20101117221713.GA27736-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
  2010-11-18 10:06                             ` Tejun Heo
@ 2010-11-18 20:25                             ` Oren Laadan
  1 sibling, 0 replies; 49+ messages in thread
From: Oren Laadan @ 2010-11-18 20:25 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ksummit-2010-discuss-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Tejun Heo, Linux Containers, hch-jcswGhMUV9g



On 11/17/2010 05:17 PM, Matt Helsley wrote:
> On Wed, Nov 17, 2010 at 12:57:40PM +0100, Tejun Heo wrote:
>> Hello, Oren.
>>
>> On 11/07/2010 10:59 PM, Oren Laadan wrote:

<snip>

>>> Or we could use linux-cr for that: do the c/r in the kernel,
>>> keep the know-how in the kernel, expose (and commit to) a
>>> per-kernel-version ABI (not vow to keep countless new individual
> 
> Oren, that statement might be read to imply that it's based on
> something as useless as kernel version numbers. Arnd has pointed out in the
> past how unsuitable that is and I tend to agree. There are at least two
> possible things we can relate it to: the SHA of the compiled kernel tree
> (which doesn't quite work because it assumes everybody uses git trees :( ),
> or perhaps the SHA/hash of the cpp-processed checkpoint_hdr.h. We could
> also stuff that header into the kernel (much like kconfigs are output from
> /proc) for programs that want the kernel to describe the ABI to them.

BTW, it's the same for userspace c/r: for the same set of features,
the format (ABI) remains unchanged. Adding features breaks this and
a new version is necessary, and conversion from old to new will be
needed.

Moreover, supporting a new feature in userspace means adding the
proper API/ABI in the kernel, including refactoring etc, which is
even harder than adding the support for it in linux-cr.

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-18  9:48                                 ` Tejun Heo
  2010-11-18 20:13                                   ` Jose R. Santos
@ 2010-11-19  3:54                                   ` Serge Hallyn
  1 sibling, 0 replies; 49+ messages in thread
From: Serge Hallyn @ 2010-11-19  3:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Eric W. Biederman, Linux Containers

Quoting Tejun Heo (tj@kernel.org):
> * And, most of all, there are userland implementation and
>   virtualization, making the benefit to overhead ratio completely off.
>   Userland implementation _already_ achieves most of what's necessary

Guess I'll just be offensive here and say, straight-out:  I don't
believe it.  Can I see the userspace implementation of c/r?

If it's as good as the kernel level c/r, then aweseome - we don't
need the kernel patches.

If it's not as good, then the thing is, we're not drawing arbitrary
lines saying "is this good enough", rather we want completely
reliable and transparent c/r.  IOW, the running task and the other
end can't tell that a migration happened, and, if checkpoint says
it worked, then restart must succeed.

-serge

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-17 15:46                           ` Tejun Heo
  2010-11-18  9:13                             ` Pavel Emelyanov
  2010-11-18 19:53                             ` Oren Laadan
@ 2010-11-19  4:10                             ` Serge Hallyn
  2010-11-19 14:04                               ` Tejun Heo
  2 siblings, 1 reply; 49+ messages in thread
From: Serge Hallyn @ 2010-11-19  4:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Linux Containers,
	Eric W. Biederman

Quoting Tejun Heo (tj@kernel.org):
> Hello, Serge.

Hey Tejun  :)

> On 11/17/2010 04:39 PM, Serge E. Hallyn wrote:
> >> I'm sorry but in-kernel CR already looks like a major misdesign to me.
> > 
> > By this do you mean the very idea of having CR support in the kernel?
> > Or our design of it in the kernel?
> 
> The former, I'm afraid.
> 
> > Let's go back to July 2008, at the containers mini-summit, where it
> > was unanimously agreed upon that the kernel was the right place
> > (Checkpoint/Resetart [CR] under
> > http://wiki.openvz.org/Containers/Mini-summit_2008_notes ), and that
> > we would start by supporting a single task with no resources.  Was
> > that whole discussion effectively misguided, in your opinion?  Or do
> > you feel that since the first steps outlined in that discussion
> > we've either "gone too far" or strayed in the subsequent design?
> 
> The conclusion doesn't seem like such a good idea, well, at least to
> me for what it's worth.  Conclusions at summits don't carry decisive
> weight.

Of course.  It allows us to present at kernel summit and look for early
rejections to save us all some time (which we did, at the container
mini-summit readout at ksummit 2008), but it would be silly to read
anything more into it than that.

> It'll still have to prove its worthiness for mainline all the
> same

100% agreed.

> and in light of already working userland alternative and the

Here's where we disagree.  If you are right about a viable userland
alternative ('already working' isn't even a preqeq in my opinion,
so long as it is really viable), then I'm with you, but I'm not buying
it at this point.

Seriously.  Truly.  Honestly.  I am *not* looking for any extra kernel
work at this moment, if we can help it in any way.

> expanded area now covered by virtualization, the arguments in this
> thread don't seem too strong.

-serge

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19  4:10                             ` Serge Hallyn
@ 2010-11-19 14:04                               ` Tejun Heo
  2010-11-20 18:05                                 ` Oren Laadan
       [not found]                                 ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 2 replies; 49+ messages in thread
From: Tejun Heo @ 2010-11-19 14:04 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul, Linux Containers,
	Eric W. Biederman

On 11/19/2010 05:10 AM, Serge Hallyn wrote:
> Hey Tejun  :)

Hey, :-)

>> and in light of already working userland alternative and the
> 
> Here's where we disagree.  If you are right about a viable userland
> alternative ('already working' isn't even a preqeq in my opinion,
> so long as it is really viable), then I'm with you, but I'm not buying
> it at this point.
> 
> Seriously.  Truly.  Honestly.  I am *not* looking for any extra kernel
> work at this moment, if we can help it in any way.

What's so wrong with Gene's work?  Sure, it has some hacky aspects but
let's fix those up.  To me, it sure looks like much saner and
manageable approach than in-kernel CR.  We can add nested ptrace,
CLONE_SET_PID (or whatever) in pidns, integrate it with various ns
supports, add an ability to adjust brk, export inotify state via
fdinfo and so on.

The thing is already working, the codebase of core part is fairly
small and condor is contemplating integrating it, so at least some
people in HPC segment think it's already viable.  Maybe the HPC
cluster I'm currently sitting near is special case but people here
really don't run very fancy stuff.  In most cases, they're fairly
simple (from system POV) C programs reading/writing data and burning a
_LOT_ of CPU cycles inbetween and admins here seem to think dmtcp
integrated with condor would work well enough for them.

Sure, in-kernel CR has better or more reliable coverage now but by how
much?  The basic things are already there in userland.  The tradeoff
simply doesn't make any sense.  If it were a well separated self
sustained feature, it probably would be able to get in, but it's all
over the place and requires a completely new concept - the
quasi-ABI'ish binary blob which would probably be portable across
different kernel versions with some massaging.  I personally think the
idea is fundamentally flawed (just go through the usual ABI!) but even
if it were not it would require _MUCH_ stronger rationale than it
currently has to be even considered for mainline inclusion.

Maybe it's just me but most of the arguments for in-kernel CR look
very weak.  They're either about remote toy use cases or along the
line that userland CR currently doesn't do everything kernel CR does
(yet).  Even if it weren't for me, I frankly can't see how it would be
included in mainline.

I think it would be best for everyone to improve userland CR.  A lot
of knowdledge and experience gained through kernel CR would be
applicable and won't go wasted.  Strong resistance against direction
change certainly is understandable but IMHO pushing the current
direction would only increase loss.  I of course could be completely
wrong and might end up getting mails filled up with megabytes of "told
you so" later, but, well, at this point, in-kernel CR already looks
half dead to me.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                                 ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2010-11-19 14:36                                   ` Kirill Korotaev
       [not found]                                     ` <04F4899E-B5C7-4BAF-8F2F-05D507A91408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2010-11-20 18:08                                   ` Oren Laadan
  2010-11-20 18:11                                   ` Oren Laadan
  2 siblings, 1 reply; 49+ messages in thread
From: Kirill Korotaev @ 2010-11-19 14:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kapil Arya, Pavel Emelianov, Gene Cooperman,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Eric W. Biederman, Linux Containers

Tejun,

Sorry for getting into the middle of the discussion, but...

Can you imagine how many userland APIs are needed to make userspace C/R?

Do you really want APIs in user-space which allow to:
- send signals with siginfo attached (kill() doesn't work...)
- read inotify configuration
- insert SKB's into socket buffers
- setup all TCP/IP parameters for sockets
- wait for AIO pending in other processes
- setting different statistics counters (like netdev stats etc.)
and so on...

For every small piece of functionality you will need to export ABI and maintain it forever.
It's thousands of APIs! And why the hell they are needed in user space at all?

BTW, HPC case you are talking about is probably the simplest one. Last time I looked into it, IBM Meiosis c/r 
didn't even bother with tty's migration. In OpenVZ we really do need much more then that like
autofs/NFS support, preserve statistics, TTYs, etc. etc. etc.

Thanks,
Kirill

On Nov 19, 2010, at 17:04 , Tejun Heo wrote:

> On 11/19/2010 05:10 AM, Serge Hallyn wrote:
>> Hey Tejun  :)
> 
> Hey, :-)
> 
>>> and in light of already working userland alternative and the
>> 
>> Here's where we disagree.  If you are right about a viable userland
>> alternative ('already working' isn't even a preqeq in my opinion,
>> so long as it is really viable), then I'm with you, but I'm not buying
>> it at this point.
>> 
>> Seriously.  Truly.  Honestly.  I am *not* looking for any extra kernel
>> work at this moment, if we can help it in any way.
> 
> What's so wrong with Gene's work?  Sure, it has some hacky aspects but
> let's fix those up.  To me, it sure looks like much saner and
> manageable approach than in-kernel CR.  We can add nested ptrace,
> CLONE_SET_PID (or whatever) in pidns, integrate it with various ns
> supports, add an ability to adjust brk, export inotify state via
> fdinfo and so on.
> 
> The thing is already working, the codebase of core part is fairly
> small and condor is contemplating integrating it, so at least some
> people in HPC segment think it's already viable.  Maybe the HPC
> cluster I'm currently sitting near is special case but people here
> really don't run very fancy stuff.  In most cases, they're fairly
> simple (from system POV) C programs reading/writing data and burning a
> _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp
> integrated with condor would work well enough for them.
> 
> Sure, in-kernel CR has better or more reliable coverage now but by how
> much?  The basic things are already there in userland.  The tradeoff
> simply doesn't make any sense.  If it were a well separated self
> sustained feature, it probably would be able to get in, but it's all
> over the place and requires a completely new concept - the
> quasi-ABI'ish binary blob which would probably be portable across
> different kernel versions with some massaging.  I personally think the
> idea is fundamentally flawed (just go through the usual ABI!) but even
> if it were not it would require _MUCH_ stronger rationale than it
> currently has to be even considered for mainline inclusion.
> 
> Maybe it's just me but most of the arguments for in-kernel CR look
> very weak.  They're either about remote toy use cases or along the
> line that userland CR currently doesn't do everything kernel CR does
> (yet).  Even if it weren't for me, I frankly can't see how it would be
> included in mainline.
> 
> I think it would be best for everyone to improve userland CR.  A lot
> of knowdledge and experience gained through kernel CR would be
> applicable and won't go wasted.  Strong resistance against direction
> change certainly is understandable but IMHO pushing the current
> direction would only increase loss.  I of course could be completely
> wrong and might end up getting mails filled up with megabytes of "told
> you so" later, but, well, at this point, in-kernel CR already looks
> half dead to me.
> 
> Thank you.
> 
> -- 
> tejun
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                                     ` <04F4899E-B5C7-4BAF-8F2F-05D507A91408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2010-11-19 15:33                                       ` Tejun Heo
  2010-11-19 16:00                                         ` Alexey Dobriyan
  2010-11-20 17:58                                         ` Oren Laadan
  0 siblings, 2 replies; 49+ messages in thread
From: Tejun Heo @ 2010-11-19 15:33 UTC (permalink / raw)
  To: Kirill Korotaev
  Cc: Kapil Arya, Pavel Emelianov, Gene Cooperman,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Eric W. Biederman, Linux Containers

Hello,

On 11/19/2010 03:36 PM, Kirill Korotaev wrote:
> Can you imagine how many userland APIs are needed to make userspace C/R?
> 
> Do you really want APIs in user-space which allow to:
> - send signals with siginfo attached (kill() doesn't work...)

Doesn't rt_sigqueueinfo() already do this?

> - read inotify configuration

This would be nice even apart from CR.

> - insert SKB's into socket buffers

Can't we drain kernel buffers?  ie. Stop further writing and wait the
send-q to drop to zero.

> - setup all TCP/IP parameters for sockets

I _think_ most can be restored by talking to netfilter module.
Setting outgoing sequence number might be beneficial tho.

> - wait for AIO pending in other processes

I haven't looked at aio implementation for a while now but can't we
drain these upon checkpointing and just carry the completion status?
Also, if aio is what you're concerned about, I would say the problem
is mostly solved.

> - setting different statistics counters (like netdev stats etc.)
> and so on...

Why would this matter?

> For every small piece of functionality you will need to export ABI
> and maintain it forever.  It's thousands of APIs! And why the hell
> they are needed in user space at all?

I think it's actually quite the contrary.  Most things are already
visible to userland.  They _have_ to be and that's the reason why
userland implementation can already get most things working without
any change to the kernel with some amount of hackery.  To me in-kernel
CR seems to approach the problem from the exactly wrong direction -
rather than dealing with specific exceptions, it create a completely
new framework which is very foreign and not useful outside of CR.

Also, think about it.  Which one is better?  A kernel which can fully
show its ABI visible states to userland or one which dumps its
internal data structurs in binary blobs.  To me, the latter seems
multiple orders of magnitude uglier.

> BTW, HPC case you are talking about is probably the simplest
> one.

Yet, it is one of the the most important / relevant use cases.

> Last time I looked into it, IBM Meiosis c/r didn't even bother with
> tty's migration.  In OpenVZ we really do need much more then that
> like autofs/NFS support, preserve statistics, TTYs, etc. etc. etc.

Would it be impossible to preserve autofs/NFS and TTYs from userland?
Then, why so?  For statistics, I'm a bit lost.  Why does it matter and
even if it does would it justify putting the whole CR inside kernel?

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 15:33                                       ` Tejun Heo
@ 2010-11-19 16:00                                         ` Alexey Dobriyan
  2010-11-19 16:01                                           ` Alexey Dobriyan
  2010-11-19 16:06                                           ` Tejun Heo
  2010-11-20 17:58                                         ` Oren Laadan
  1 sibling, 2 replies; 49+ messages in thread
From: Alexey Dobriyan @ 2010-11-19 16:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

On Fri, Nov 19, 2010 at 5:33 PM, Tejun Heo <tj@kernel.org> wrote:
>> - insert SKB's into socket buffers
>
> Can't we drain kernel buffers?  ie. Stop further writing and wait the
> send-q to drop to zero.

On send:
if network dies right after freeze, you lose.

On receive:
packets arrive after process freeze, but before network device freeze.

>> - setting different statistics counters (like netdev stats etc.)
>> and so on...
>
> Why would this matter?

Because you'll introduce million stupid interfaces not interesting to
anyone but C/R.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 16:00                                         ` Alexey Dobriyan
@ 2010-11-19 16:01                                           ` Alexey Dobriyan
  2010-11-19 16:10                                             ` Tejun Heo
  2010-11-19 16:06                                           ` Tejun Heo
  1 sibling, 1 reply; 49+ messages in thread
From: Alexey Dobriyan @ 2010-11-19 16:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

On Fri, Nov 19, 2010 at 6:00 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
>>> - setting different statistics counters (like netdev stats etc.)
>>> and so on...
>>
>> Why would this matter?
>
> Because you'll introduce million stupid interfaces not interesting to
> anyone but C/R.

Just like CLONE_SET_PID.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 16:00                                         ` Alexey Dobriyan
  2010-11-19 16:01                                           ` Alexey Dobriyan
@ 2010-11-19 16:06                                           ` Tejun Heo
  2010-11-19 16:16                                             ` Alexey Dobriyan
  1 sibling, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2010-11-19 16:06 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

Hello,

On 11/19/2010 05:00 PM, Alexey Dobriyan wrote:
> On Fri, Nov 19, 2010 at 5:33 PM, Tejun Heo <tj@kernel.org> wrote:
>>> - insert SKB's into socket buffers
>>
>> Can't we drain kernel buffers?  ie. Stop further writing and wait the
>> send-q to drop to zero.
> 
> On send:
> if network dies right after freeze, you lose.

Gosh, if you're really worried about that, put a netfilter module
which would buffer and simulate acks to extract the packets before
initiating freeze.  These are fringe problems.  Use fringe solutions.

> On receive:
> packets arrive after process freeze, but before network device freeze.

Just store the data somewhere.  The checkpointer can drain the socket,
right?

>>> - setting different statistics counters (like netdev stats etc.)
>>> and so on...
>>
>> Why would this matter?
> 
> Because you'll introduce million stupid interfaces not interesting to
> anyone but C/R.

In this thread, how many have you guys come up with?  Not even a dozen
and most can be sovled almost trivially.  Seriously, what the hell..

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 16:01                                           ` Alexey Dobriyan
@ 2010-11-19 16:10                                             ` Tejun Heo
  2010-11-19 16:25                                               ` Alexey Dobriyan
  0 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2010-11-19 16:10 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

On 11/19/2010 05:01 PM, Alexey Dobriyan wrote:
> On Fri, Nov 19, 2010 at 6:00 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
>>>> - setting different statistics counters (like netdev stats etc.)
>>>> and so on...
>>>
>>> Why would this matter?
>>
>> Because you'll introduce million stupid interfaces not interesting to
>> anyone but C/R.
> 
> Just like CLONE_SET_PID.

Well, if you ask me, having pidns w/o a way to reinstate PID from
userland is pretty silly and you and I might not know yet but it's
quite imaginable that there will be other use cases for the capability
unlike in-kernel CR.  Kernel provides building blocks not the whole
frigging package and for very good reasons.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 16:06                                           ` Tejun Heo
@ 2010-11-19 16:16                                             ` Alexey Dobriyan
  2010-11-19 16:19                                               ` Tejun Heo
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Dobriyan @ 2010-11-19 16:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

On Fri, Nov 19, 2010 at 6:06 PM, Tejun Heo <tj@kernel.org> wrote:
>>>> - setting different statistics counters (like netdev stats etc.)
>>>> and so on...
>>>
>>> Why would this matter?
>>
>> Because you'll introduce million stupid interfaces not interesting to
>> anyone but C/R.
>
> In this thread, how many have you guys come up with?  Not even a dozen
> and most can be sovled almost trivially.  Seriously, what the hell..

I do not count them.

The paragon of absurdity is struct task_struct::did_exec .

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 16:16                                             ` Alexey Dobriyan
@ 2010-11-19 16:19                                               ` Tejun Heo
  2010-11-19 16:27                                                 ` Alexey Dobriyan
  0 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2010-11-19 16:19 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

On 11/19/2010 05:16 PM, Alexey Dobriyan wrote:
> On Fri, Nov 19, 2010 at 6:06 PM, Tejun Heo <tj@kernel.org> wrote:
>>>>> - setting different statistics counters (like netdev stats etc.)
>>>>> and so on...
>>>>
>>>> Why would this matter?
>>>
>>> Because you'll introduce million stupid interfaces not interesting to
>>> anyone but C/R.
>>
>> In this thread, how many have you guys come up with?  Not even a dozen
>> and most can be sovled almost trivially.  Seriously, what the hell..
> 
> I do not count them.
> 
> The paragon of absurdity is struct task_struct::did_exec .

Yeah, then go and figure how to do that in a way which would be useful
for other purposes too instead of trying to shove the whole
checkpointer inside the kernel.  It sure would be harder but hey
that's the way it is.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 16:10                                             ` Tejun Heo
@ 2010-11-19 16:25                                               ` Alexey Dobriyan
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Dobriyan @ 2010-11-19 16:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

On Fri, Nov 19, 2010 at 6:10 PM, Tejun Heo <tj@kernel.org> wrote:
> Well, if you ask me, having pidns w/o a way to reinstate PID from
> userland is pretty silly

No.
Chrome uses CLONE_PID so that exploit couldn't attach to processes in
parent pidns.

> and you and I might not know yet but it's
> quite imaginable that there will be other use cases for the capability
> unlike in-kernel CR.  Kernel provides building blocks not the whole
> frigging package and for very good reasons.

Speaking of pids, pid's value itself is never interesing (except maybe pid 1).
It's a cookie.

CLONE_SET_PID came up only now because only C/R wants it.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 16:19                                               ` Tejun Heo
@ 2010-11-19 16:27                                                 ` Alexey Dobriyan
       [not found]                                                   ` <AANLkTin7kd3crS+fTLLea5PhAii7B3dz=n7p7YtQ6d4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Dobriyan @ 2010-11-19 16:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote:
>> The paragon of absurdity is struct task_struct::did_exec .
>
> Yeah, then go and figure how to do that in a way which would be useful
> for other purposes too instead of trying to shove the whole
> checkpointer inside the kernel.  It sure would be harder but hey
> that's the way it is.

System call for one bit? This is ridiculous.
Doing execve(2) for userspace C/R is ridicoulous too (and likely doesn't work).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                                                   ` <AANLkTin7kd3crS+fTLLea5PhAii7B3dz=n7p7YtQ6d4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-11-19 16:32                                                     ` Tejun Heo
  2010-11-19 16:38                                                       ` Alexey Dobriyan
  0 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2010-11-19 16:32 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Kapil Arya, Kirill Korotaev, Pavel Emelianov, Gene Cooperman,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Eric W. Biederman, Linux Containers

On 11/19/2010 05:27 PM, Alexey Dobriyan wrote:
> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>> The paragon of absurdity is struct task_struct::did_exec .
>>
>> Yeah, then go and figure how to do that in a way which would be useful
>> for other purposes too instead of trying to shove the whole
>> checkpointer inside the kernel.  It sure would be harder but hey
>> that's the way it is.
> 
> System call for one bit? This is ridiculous.

Why not just a flag in proc entry?  It's a frigging single bit.

> Doing execve(2) for userspace C/R is ridicoulous too (and likely
> doesn't work).

Really, whatever.  Just keep doing what you're doing.  Hey, if it
makes you happy, it can't be too wrong.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 16:32                                                     ` Tejun Heo
@ 2010-11-19 16:38                                                       ` Alexey Dobriyan
  2010-11-19 16:50                                                         ` Tejun Heo
  0 siblings, 1 reply; 49+ messages in thread
From: Alexey Dobriyan @ 2010-11-19 16:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <tj@kernel.org> wrote:
> On 11/19/2010 05:27 PM, Alexey Dobriyan wrote:
>> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote:
>>>> The paragon of absurdity is struct task_struct::did_exec .
>>>
>>> Yeah, then go and figure how to do that in a way which would be useful
>>> for other purposes too instead of trying to shove the whole
>>> checkpointer inside the kernel.  It sure would be harder but hey
>>> that's the way it is.
>>
>> System call for one bit? This is ridiculous.
>
> Why not just a flag in proc entry?  It's a frigging single bit.

Because /proc/*/did_exec useless to anyone but C/R (even for reading!).

Because code is much simpler:

    tsk->did_exec = !!tsk_img->did_exec;
+
    __u8 did_exec;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 16:38                                                       ` Alexey Dobriyan
@ 2010-11-19 16:50                                                         ` Tejun Heo
  2010-11-19 16:55                                                           ` Alexey Dobriyan
  0 siblings, 1 reply; 49+ messages in thread
From: Tejun Heo @ 2010-11-19 16:50 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

On 11/19/2010 05:38 PM, Alexey Dobriyan wrote:
> On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <tj@kernel.org> wrote:
>> On 11/19/2010 05:27 PM, Alexey Dobriyan wrote:
>>> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote:
>>>>> The paragon of absurdity is struct task_struct::did_exec .
>>>>
>>>> Yeah, then go and figure how to do that in a way which would be useful
>>>> for other purposes too instead of trying to shove the whole
>>>> checkpointer inside the kernel.  It sure would be harder but hey
>>>> that's the way it is.
>>>
>>> System call for one bit? This is ridiculous.
>>
>> Why not just a flag in proc entry?  It's a frigging single bit.
> 
> Because /proc/*/did_exec useless to anyone but C/R (even for reading!).

I don't think you'll need a full file.  Just shove it in status or
somewhere.  Your argument is completely absurd.  So, because exporting
single bit is so horrible to everyone else, you want to shove the
whole frigging checkpointer inside the kernel?

> Because code is much simpler:
> 
>     tsk->did_exec = !!tsk_img->did_exec;
> +
>     __u8 did_exec;

Sigh, yeah, except for the horror show to create tsk_img.  Your
"paragon of absurdity" is did_exec which is only ever used to decide
whether setpgid() should fail with -EACCES, seriously?  Here's a
thought.  Ignore it for now and concentrate on more relevant problems.
I'm fairly sure CR'd program malfunctioning over did_exec wouldn't
mark the beginning of the end of our civilization.  You gotta be
kidding me.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 16:50                                                         ` Tejun Heo
@ 2010-11-19 16:55                                                           ` Alexey Dobriyan
  0 siblings, 0 replies; 49+ messages in thread
From: Alexey Dobriyan @ 2010-11-19 16:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kirill Korotaev, Serge Hallyn, Kapil Arya, Gene Cooperman,
	linux-kernel@vger.kernel.org, Pavel Emelianov, Eric W. Biederman,
	Linux Containers

On Fri, Nov 19, 2010 at 6:50 PM, Tejun Heo <tj@kernel.org> wrote:
> On 11/19/2010 05:38 PM, Alexey Dobriyan wrote:
>> On Fri, Nov 19, 2010 at 6:32 PM, Tejun Heo <tj@kernel.org> wrote:
>>> On 11/19/2010 05:27 PM, Alexey Dobriyan wrote:
>>>> On Fri, Nov 19, 2010 at 6:19 PM, Tejun Heo <tj@kernel.org> wrote:
>>>>>> The paragon of absurdity is struct task_struct::did_exec .
>>>>>
>>>>> Yeah, then go and figure how to do that in a way which would be useful
>>>>> for other purposes too instead of trying to shove the whole
>>>>> checkpointer inside the kernel.  It sure would be harder but hey
>>>>> that's the way it is.
>>>>
>>>> System call for one bit? This is ridiculous.
>>>
>>> Why not just a flag in proc entry?  It's a frigging single bit.
>>
>> Because /proc/*/did_exec useless to anyone but C/R (even for reading!).
>
> I don't think you'll need a full file.  Just shove it in status or
> somewhere.  Your argument is completely absurd.  So, because exporting
> single bit is so horrible to everyone else, you want to shove the
> whole frigging checkpointer inside the kernel?
>
>> Because code is much simpler:
>>
>>     tsk->did_exec = !!tsk_img->did_exec;
>> +
>>     __u8 did_exec;
>
> Sigh, yeah, except for the horror show to create tsk_img.

task_struct image work is common for both userspace C/R and in-kernel.
You _have_ to define it.
Simpler code is only first line.

> Your "paragon of absurdity" is did_exec which is only ever used
> to decide whether setpgid() should fail with -EACCES, seriously?
> Here's a thought.  Ignore it for now and concentrate on more
> relevant problems.

You're so newjerseyly now.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 15:33                                       ` Tejun Heo
  2010-11-19 16:00                                         ` Alexey Dobriyan
@ 2010-11-20 17:58                                         ` Oren Laadan
  1 sibling, 0 replies; 49+ messages in thread
From: Oren Laadan @ 2010-11-20 17:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kirill Korotaev, Kapil Arya, Pavel Emelianov, Gene Cooperman,
	linux-kernel@vger.kernel.org, Eric W. Biederman, Linux Containers

On Fri, 19 Nov 2010, Tejun Heo wrote:

> Hello,
> 
> On 11/19/2010 03:36 PM, Kirill Korotaev wrote:
> > Can you imagine how many userland APIs are needed to make userspace C/R?
> > 
> > Do you really want APIs in user-space which allow to:
> > - send signals with siginfo attached (kill() doesn't work...)
> 
> Doesn't rt_sigqueueinfo() already do this?
> 

You assume that c/r is done by the checkpointed processes _themselves_,
that is that to checkpoint a process that process need to be made runnable 
and it will save its own state (which is the model of dmtcp, but not of
using ptrace). 

This model is restrictive: it requires that you hijack the execution of
that process somehow and make it run. What if the process isn't runnable
(e.g. in vfork waiting for completion, or ptraced deep in the kernel) ?
letting it run even just a bit may modify its state. It also means that
if you have many processes in the checkpointed session, e.g. 1000, then
_all_ of them will have to be scheduled to run !

With kernel c/r this is unnecessary:  you can use an auxiliary process
to checkpoint other processes without scheduling the other processes.
I.e. it's _transparent_ and _preemptive_.

Another advantage is that if anything fails during checkpoint (for 
whatever reason), there are no side-effects (which is not the case with
the other method).

> > For every small piece of functionality you will need to export ABI
> > and maintain it forever.  It's thousands of APIs! And why the hell
> > they are needed in user space at all?
> 
> I think it's actually quite the contrary.  Most things are already
> visible to userland.  They _have_ to be and that's the reason why
> userland implementation can already get most things working without
> any change to the kernel with some amount of hackery.  To me in-kernel
> CR seems to approach the problem from the exactly wrong direction -
> rather than dealing with specific exceptions, it create a completely
> new framework which is very foreign and not useful outside of CR.
> 
> Also, think about it.  Which one is better?  A kernel which can fully
> show its ABI visible states to userland or one which dumps its
> internal data structurs in binary blobs.  To me, the latter seems
> multiple orders of magnitude uglier.

Are we jusding aesteics ?  To me the former looks uglier...

The amount of fragile hacks you need to go through to make it work
in userspace for the generic cases (including userspace trickery
and new crazy APIs from the kernel for state that was never even an 
ABI, like skb's), and the restrictions it posses simply suggest that 
userspace is not the right place to do it. 

Thanks,

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-19 14:04                               ` Tejun Heo
@ 2010-11-20 18:05                                 ` Oren Laadan
       [not found]                                 ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  1 sibling, 0 replies; 49+ messages in thread
From: Oren Laadan @ 2010-11-20 18:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge Hallyn, Kapil Arya, Gene Cooperman, linux-kernel, xemul,
	Eric W. Biederman, Linux Containers

Hi,

Based on discussion with Gene, I'd like to clarify key points and
difference between kernel and userspace approaches (specifically
linux-cr and dmtcp): three parts to break the long post...

part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches

[now relax, grab (another) cup of coffee and read on...]

PART I:  ==PERSPECTIVE==

A rough classification of c/r categories:

* container-c/r: important use-case, e.g. c/r and migration of an
  application containers like VPS (virtual private server), VDI
  (desktop) or  other self-contained application (e.g. Oracle server).
  Here _all_ the relevant processes are included in the checkpoint.

* standalone-c/r: another use-case is standalone-c/r where a set of
  processes is checkpointed, but not the entire environment, and then
  those processes are restarted in a different "eco-system".

* distributed-c/r: meaning several sets of processes, each running
  on a different host. (Each set may be a separate container there).

In container-c/r, the main challenge is to be _reliable_ in the sense
that a restart from a successful checkpoint should always succeed.

In standalone-c/r, the main challenge is that an application resumes
execution after a restart in a possible _different_ eco-system. Some
application don't care (e.g 'bc'). Other applications do care, and to
different degrees; for these we need "glue" to pacify the application.

There are generally three types of "glue":

(1) Modify the application or selected libraries to be c/r-aware, and
  notify it when restart completes. (e.g. CoCheck MPI library).
(2) Add a userspace helper that will run post-restart to do necessary
  trickery (eg. send a SIGWINCH to 'screen'; mount proper filesystem
  at the new host after migration; reconnect a socket to a peer).
(3) Use interposition on selected library calls and add wrapper code
  that will glue in what's missing (e.g. dbus or nscd calls to
  reconnect an application to those services).

IMPORTANT: the glueing method is _orthogonal_ to how the c/r is done !
We are strictly discussion the core c/r functionality.

(next part: linux-cr philosophy...)

Thanks,

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                                 ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2010-11-19 14:36                                   ` Kirill Korotaev
@ 2010-11-20 18:08                                   ` Oren Laadan
  2010-11-20 18:11                                   ` Oren Laadan
  2 siblings, 0 replies; 49+ messages in thread
From: Oren Laadan @ 2010-11-20 18:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	xemul-3ImXcnM4P+0, Eric W. Biederman, Linux Containers


login as: orenl
Using keyboard-interactive authentication.
Password:
Access denied
Using keyboard-interactive authentication.
Password:
Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il
499:takamine[~]$ pine
  PINE 4.64   COMPOSE MESSAGE                                                                     
Folder: Drafts  8 Messages  +

To      : Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc      : Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>,
          Kapil Arya <kapil-1vnkWVZi4QaVc3sceRu5cw@public.gmane.org>,
          Gene Cooperman <gene-1vnkWVZi4QaVc3sceRu5cw@public.gmane.org>,
          linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
          xemul-3ImXcnM4P+0@public.gmane.org,
          "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>,
          Linux Containers <containers-qjLDD68F18O7TbgM5vRIOg@public.gmane.org>
Attchmnt:
Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
----- Message Text -----
Hi,

[continuation of posting regarding kernel vs userspace approach]

part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches


PART II:  ==PHILOSOPHY==

Linux-cr is a _generic_ c/r-engine with multiple capabilities. It can
checkpoint a full container, a process hierarchy, or a single process,
For containers, it provides guarantees like restart-ability; For the
others, it provides the flexibility so that c/r-aware applications,
libraries, helpers, and wrappers can glue what they wish to glue.

1) Transparent - completely transparent for container-c/r, and largely
  so for standalone-cr ("largely" - as in except for the glue which is
  needed due to loss of eco-system, not due to restarting).
2) Reliable - if checkpoint succeeds that it is guaranteed for
  to succeed too (for container-c/r).
3) Preemtptive - works without requiring that checkpointed processes
  be scheduled to run (and thus "collaborate")
4) Complete - covers all visible and hidden state in the kernel
  about processes (even if not directly visible to userspace)
5) Efficient - can be optimized along multiple axes: _zero_ impact on
  runtime, low downtime during checkpoint, partial and incremental
  checkpoint, live-migration, etc.
6) Flexible - can integrate nicely with different userspace "glueing"
  methods.
7) Maintainable - small part of the code is to refactor kernel code
  so that it can be reused in restart; the rest is new code that in
  our experience rarely changes. Same hods for the image format.

What linux-cr _does not_ do in the kernel, nor plans to support is:

1) Hardware devices: their state is per-device/vendor. Instead one
   should use virtual devices (VNC for dislpay, pulseaudio for sound,
   screen for ttys), or have a userspace glue to restore the state of
   the device. That said, in the future vendors may opt to provide
   logic for c/r in drivers, e.g. ->checkpoint, ->restart methods.
2) Userspace glue: (as defined for standalone-c/r above) the kernel
   knows about processes and their state, not about their intentions.
   We leave that for userspace.
3) External dependencies: (outside of the local host) the kernel does
   not control what's outside the host. That is the responsibility of
   userspace. (Even with live-migration, the linux-cr only restores
   the local state of the TCP connections).

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                                 ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2010-11-19 14:36                                   ` Kirill Korotaev
  2010-11-20 18:08                                   ` Oren Laadan
@ 2010-11-20 18:11                                   ` Oren Laadan
       [not found]                                     ` <4CE69B8C.6050606-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2 siblings, 1 reply; 49+ messages in thread
From: Oren Laadan @ 2010-11-20 18:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	xemul-3ImXcnM4P+0, Eric W. Biederman, Linux Containers


login as: orenl
Using keyboard-interactive authentication.
Password:
Access denied
Using keyboard-interactive authentication.
Password:
Last login: Fri Nov 19 10:17:21 2010 from 192.117.42.81.static.012.net.il
499:takamine[~]$ pine
  PINE 4.64   COMPOSE MESSAGE                                                                     
Folder: Drafts  8 Messages  +

To      : Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc      : Serge Hallyn <serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw@public.gmane.org>,
          Kapil Arya <kapil-1vnkWVZi4QaVc3sceRu5cw@public.gmane.org>,
          Gene Cooperman <gene-1vnkWVZi4QaVc3sceRu5cw@public.gmane.org>,
          linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
          xemul-3ImXcnM4P+0@public.gmane.org,
          "Eric W. Biederman" <ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>,
          Linux Containers <containers-qjLDD68F18O7TbgM5vRIOg@public.gmane.org>
Fcc     : imap://ol2104-u1PCbA9B4pbMrJhsLK8IO4dd74u8MsAO@public.gmane.org/Sent
Attchmnt:
Subject : Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
----- Message Text -----
Hi,

[continuation of discussion of kernel vs userspace c/r approach]
part I: perpsectice about the types of scopes of c/r in discussion
part II: linux-cr design adn objectives
part III: comparison kernel/userspace approaches


PART III:  ==SOME TECHNICAL ASPECTS==

Important to know about userspace (DMTCP example) before presenting a
comparison between kernel and userspace approaches:

DMTCP has two components: 1) c/r-engine to save/restore process state,
and 2) glue to restart processes out of their original context. They
are _orthogonal_: the glue can be used with of other c/r-engines, like
linux-cr. This discussion refers to the c/r-engine _only_.

Focusing on the c/r-engine of DMTCP - it uses syscall interposition
for three reasons:

1) To take control of processes at checkpoint
2) To always track state of resources not visible to userspace
3) To virtualize identifiers after restart

#1 is needed because processes saves their own state (and need to run
the checkpoint code for that).

#2 is needed because the kernel does not expose all state, and #3 is
needed because the kernel does not give ways to restore all state. So
these two logics are used to mirror in userspace functionality that
already exists in the kernel.

The main advantages of the approach: (a) portability to other system
(like BSD), though with considerable effort (b) it's "good enough" for
several use-cases, without kernel changes.

Putting the c/r-engine in the kernel provides many advantages, which I
summarize in the following table:

category        linux-cr                        userspace
--------------------------------------------------------------------------------
PERFORMANCE     has _zero_ runtime overhead     visible overhead due to syscalls
                                                interposition and state tracking
                                                even w/o checkpoints;

OPTIMIZATIONS   many optimizations possible     limited, less effective
                only in kernel, for downtime,   w/ much larger overhead.
                image size, live-migration

OPERATION       applications run unmodified     to do c/r, needs 'controller'
                                                task (launch and manage _entire_
                                                execution) - point of failure.
                                                restricts how a system is used.

PREEMPTIVE      checkpoint at any time, use     processes must be runnable and
                auxiliary task to save state;   "collaborate" for checkpoint;
                non-intrusive: failure does     long task coordination time
                not impact checkpointees.       with many tasks/threads. alters
                                                state of checkpointee if fails.
                                                e.g. cannot checkpoint when in
                                                vfork(), ptrace states, etc.

COVERAGE        save/restore _all_ task state;  needs new ABI for everything:
                identify shared resources; can  expose state, provide means to
                extend for new kernel features  restore state (e.g. TCP protocol
                easily                          options negotiated with peers)

RELIABILITY     checkpoint w/ single syscall;   non-atomic, cannot find leaks
                atomic operation. guaranteed    to determine restartability
                restartability for containers

USERSPACE GLUE  possible                        possible

SECURITY        root and non-root modes         root and non-root modes
                native support for LSM

MAINTENANCE     changes mainly for features     changes mainly for features;
                                                create new ABI for features

I'm not saying Gene's work isn't good - on the contrary, it's a fine
piece of engineering. However, the part of it that does c/r poses many 
constraints that limits the generality, mode of use, and performance of 
the whole. That may be enough for Tejun, for your cluster. But not 
for other users of the technology.

And by all means, I intend to cooperate with Gene to see how to
make the other part of DMTCP, namely the userspace "glue", work on
top of linux-cr to have the benefits of all worlds !

All in all, kernel c/r is far more generic and less restrictive than
userspace, can provide nice guarantees, and has superior performance.
It can do everything the a userspace c/r can do, and much more - and
that "much more" is crucial for important use cases.

Last word about maintenance - once the core code is in mainline (which
means a code "spike"), experience (both kernel/userspace) shows that
both code and image format hardly change. The format is tied to specific
set of features supported (i.e. kernel versions) so that the kernel
does not need to maintain backward compatibility.

Thanks,

Oren

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
       [not found]                                     ` <4CE69B8C.6050606-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-11-20 18:15                                       ` Oren Laadan
  2010-11-20 19:33                                         ` Tejun Heo
  0 siblings, 1 reply; 49+ messages in thread
From: Oren Laadan @ 2010-11-20 18:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kapil Arya, Gene Cooperman, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	xemul-3ImXcnM4P+0, Eric W. Biederman, Linux Containers


[[apologies for the silly prefix on last two posts - a combination
of windows, putty, pine andslow connection is not helping me :( ]]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-20 18:15                                       ` Oren Laadan
@ 2010-11-20 19:33                                         ` Tejun Heo
  2010-11-21  8:18                                           ` Gene Cooperman
  2010-11-22 17:18                                           ` Oren Laadan
  0 siblings, 2 replies; 49+ messages in thread
From: Tejun Heo @ 2010-11-20 19:33 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul,
	Eric W. Biederman, Linux Containers

Hello,

On 11/20/2010 07:15 PM, Oren Laadan wrote:
> 
> [[apologies for the silly prefix on last two posts - a combination
> of windows, putty, pine andslow connection is not helping me :( ]]

Maybe it's a good idea to post a clean concatenated version for later
reference?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-20 19:33                                         ` Tejun Heo
@ 2010-11-21  8:18                                           ` Gene Cooperman
  2010-11-21  8:21                                             ` Gene Cooperman
                                                               ` (2 more replies)
  2010-11-22 17:18                                           ` Oren Laadan
  1 sibling, 3 replies; 49+ messages in thread
From: Gene Cooperman @ 2010-11-21  8:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Oren Laadan, Kapil Arya, Gene Cooperman, linux-kernel, xemul,
	Eric W. Biederman, Linux Containers

In this post, Kapil and I will provide our own summary of how we
see the issues for discussion so far.  In the next post, we'll reply
specifically to comment on Oren's table of comparison between
linux-cr and userspace.

In general, we'd like to add that the conversation with Oren was very
useful for us, and I think Oren will also agree that we were able to
converge on the purely technical questions.

Concerning opinions, we want to be cautious on opinions, since we're
still learning the context of this ongoing discussion on LKML.  There is
probably still some context that we're missing.

Below, we'll summarize the four major questions that we've understood from
this discussion so far.  But before doing so, I want to point out that a single
process or process tree will always have many possible interactions with
the rest of the world.  Within our own group, we have an internal slogan:
  "You can't checkpoint the world."
A virtual machine can have a relatively closed world, which makes it more
robust, but checkpointing will always have some fragile parts.
We give four examples below: 
a.  time virtualization
b.  external database
c.  NSCD daemon
d.  screen and other full-screen text programs
These are not the only examples of difficult interactions with the
rest of the world.

Anyway, in my opinion, the conversation with Oren seemed to converge
into two larger cases:
1.  In a pure userland C/R like DMTCP, how many corner cases are not handled,
	or could not be handled, in a pure userland approach?
	Also, how important are those corner cases?  Do some
	have important use cases that rise above just a corner case?
	[ inotify is one of those examples.  For DMTCP to support this,
	  it would have to put wrappers around inotify_add_watch,
	  inotify_rm_watch, read, etc., and maybe even tracking inodes in case
	  the file had been renamed after the inotify_add_watch.  Something
	  could be made to work for the common cases, but it would
	  still be a hack --- to be done only if a use case demands it. ]
2.  In a Linux C/R approach, it's already recognized that one needs
	a userland component (for example, for convenience of recreating
	the process tree on restart).  How many other cases are there
	that require a userland component?
	[ One example here is the shared memory segment of NSCD, which
	  has to be re-initialized on restart.  Another example is
	  a screen process that talks to an ANSI terminal emulator
	  (e.g. gnome-terminal), which talks to an X server or VNC server.
	  Below, we discuss these examples in more detail. ]

One can add a third and fourth question here:

3.  [Originally posed by Oren] Given Linux C/R, how much work would
        it be to add the higher layers of DMTCP on top of Linux C/R?
	[ This is a non-trivial question.  As just one example, DMTCP
	  handles sockets uniformly, regardless of whether they
	  are intra-host or inter-host.  Linux C/R handles certain
	  types of intra-host sockets.  So, merging the two would
	  require some thought. ]
4.  [Originally posed by Tejun, e.g. Fri Nov 19 2010 - 09:04:42 EST]
	Given that DMTCP checkpoints many common applications, how much work
	would it be to add a small number of restricted kernel interfaces
	to enable one to remove some of the hacks in DMTCP, and to cover
	the more important corner cases that DMTCP might be missing?


I'd also like to add some points of my own here.  First, there are certain
cases where I believe that a checkpoint-restart system (in-kernel
or userland or hybrid) can never be completely transparent.  It's because you
can't completely cut the connection with the rest of the world.  In these
examples, I'm thinking primarily of the Linux C/R mode used to checkpoint
a tree of processes.
    To the extent that Linux C/R is used with containers, it seems
to me to be closer to lightweight virtualization.  From there, I've
seen that the conversation goes to comparing lightweight virtualization
versus traditional virtual machines, but that discussion goes beyond my
own personal expertise.

Here are some examples that I believe that every checkpointing system
would suffer from the syndrome of trying to "checkpoint the world".

1.  Time virtualization --- Right now, neither system does time virtualization.
Both systems could do it.  But what is the right policy?
    For example, one process may set a deadline for a task an hour
in the future, and then periodically poll the kernel for the current time
to see if one hour has passed.  This use case seems to require time
virtualization.
    A second process wants to know the current day and time, because a certain
web service updates its information at midnight each day.  This use case seems
seems to argue that time virtualization is bad.

2.  External database file on another host --- It's not possible to
checkpoint the remote database file.  In our work with the Condor developers,
they asked us to add a "Condor mode", which says that if there are any
external socket connections, then delay the checkpoint until the external
socket connections are closed.  In a different joint project with CERN (Geneva),
we considered a checkpointing application in which an application
saves much of the database, and then on restart, discovers how much
of its data is stale, and re-loads only the stale portion.

3.  NSCD (Network Services Caching Daemon) --- Glibc arranges for
certain information to be cached in the NSCD.  The information is
in a memory segment shared between the NSCD and the application.
Upon restart, the application doesn't know that the memory segment
is no longer shared with the NSCD, or that the information is stale.
The DMTCP "hack" is to zero out this memory page on restart.  Then glibc
recognizes that it needs to create a new shared memory segment.

3.  screen --- The screen application sets the scrolling region of
its ANSI terminal emulator, in order to create a status line
at the bottom, while scrolling the remaining lines of the terminal.
Upon restart, screen assumes that the scrolling region
has already been set up, and doesn't have to be re-initialized.
So, on restart, DMTCP uses SIGWINCH to fool screen (or any
full-screen text-based application) into believing that its
window size has been changed.  So, screen (or vim, or emacs)
then re-initializes the state of its ANSI terminal, including
scrolling regions and so on.
    So, a userland component is helpful in doing the kind of hacks above.
I recognize that the Linux C/R team agrees that some userland component
can be useful.  I just want to show why some userland hacks will always be
needed.  Let's consider a pure in-kernel approach to checkpointing 'screen'
(or almost any full-screen application that uses a status bar at the bottom).
Screen sets the scrolling region of an ANSI terminal emulator,
which might be a gnome-terminal.  So, a pure in-kernel approach
needs to also checkpoint the gnome-terminal.  But the gnome-terminal
needs to talk to an X server.  So, now one also needs to start
up inside a VNC server to emulate the X server.  So, either
one adds a "hack" in userland to force screen to re-initialize
its ANSI terminal emulator, or else one is forced to include
an entire VNC server just to checkpoint a screen process. ]

Finally, this excerpt below from Tejun's post sums up our views too.  We don't
have the kernel expertise of the people on this list, but we've had
to do a little bit of reading the kernel code where the documentation
was sparse and in teaching O/S.  We would certainly be very happy to work
closely with the kernel developers, if there was interest in extending
DMTCP to directly use more kernel support.

- Gene and Kapil

Tejun Heo wrote Fri Nov 19 2010 - 09:04:42 EST
> What's so wrong with Gene's work? Sure, it has some hacky aspects but
> let's fix those up. To me, it sure looks like much saner and
> manageable approach than in-kernel CR. We can add nested ptrace,
> CLONE_SET_PID (or whatever) in pidns, integrate it with various ns
> supports, add an ability to adjust brk, export inotify state via
> fdinfo and so on.
> 
> The thing is already working, the codebase of core part is fairly
> small and condor is contemplating integrating it, so at least some
> people in HPC segment think it's already viable. Maybe the HPC
> cluster I'm currently sitting near is special case but people here
> really don't run very fancy stuff. In most cases, they're fairly
> simple (from system POV) C programs reading/writing data and burning a
> _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp
> integrated with condor would work well enough for them.
> 
> Sure, in-kernel CR has better or more reliable coverage now but by how
> much? The basic things are already there in userland.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-21  8:18                                           ` Gene Cooperman
@ 2010-11-21  8:21                                             ` Gene Cooperman
  2010-11-22 18:02                                               ` Sukadev Bhattiprolu
  2010-11-23 17:53                                               ` Oren Laadan
  2010-11-21 22:41                                             ` Grant Likely
  2010-11-22 17:34                                             ` Oren Laadan
  2 siblings, 2 replies; 49+ messages in thread
From: Gene Cooperman @ 2010-11-21  8:21 UTC (permalink / raw)
  To: Gene Cooperman
  Cc: Tejun Heo, Oren Laadan, Kapil Arya, linux-kernel, xemul,
	Eric W. Biederman, Linux Containers

As Kapil and I wrote before, we benefited greatly from having talked with Oren,
and learning some more about the context of the discussion.  We were able
to understand better the good technical points that Oren was making.
    Since the comparison table below concerns DMTCP, we'd like to
state some additional technical points that could affect the conlusions.

> category        linux-cr                        userspace
> --------------------------------------------------------------------------------
> PERFORMANCE     has _zero_ runtime overhead     visible overhead due to syscalls
>                                                 interposition and state tracking
>                                                 even w/o checkpoints;

In our experiments so far, the overhead of system calls has been
unmeasurable.  We never wrap read() or write(), in order to keep overhead low.
We also never wrap pthread synchronization primitives such as locks,
for the same reason.  The other system calls are used much less often, and so
the overhead has been too small to measure in our experiments.

> OPTIMIZATIONS   many optimizations possible     limited, less effective
>                 only in kernel, for downtime,   w/ much larger overhead.
>                 image size, live-migration
 
As above, we believe that the overhead while running is negligible.  I'm
assuming that image size refers to in-kernel advantages for incremental
checkpointing.  This is useful for apps where the modified pages tend
not to dominate.  We agree with this point.  As an orthogonal point,
by default DMTCP compresses all checkpoint images using gzip on the fly.
This is useful even when most pages are modified between checkpoints.
Still, as Oren writes, Linux C/R could also add a userland component
to compress checkpoint images on the fly.
    Next, live migration is a question that we simply haven't thought much
about.  If it's important, we could think about what userland approaches might
exist, but we have no near-term plans to tackle live migration.

> OPERATION       applications run unmodified     to do c/r, needs 'controller'
>                                                 task (launch and manage _entire_
>                                                 execution) - point of failure.
>                                                 restricts how a system is used.

We'd like to clarify what may be some misconceptions.  The DMTCP
controller does not launch or manage any tasks.  The DMTCP controller
is stateless, and is only there to provide a barrier, namespace server,
and single point of contact to relay ckpt/restart commands.  Recall that
the DMTCP controller handls processes across hosts --- not just on a
single host.
    Also, in any computation involving multiple processes, _every_ process
of the computation is a point of failure.  If any process of the computation
dies, then the simple application strategy is to give up and revert to an
earlier checkpoint.  There are techniques by which an app or DMTCP can
recreate certain failed processes.  DMTCP doesn't currently recreate
a dead controller (no demand for it), but it's not hard to do technically.

> PREEMPTIVE      checkpoint at any time, use     processes must be runnable and
>                 auxiliary task to save state;   "collaborate" for checkpoint;
>                 non-intrusive: failure does     long task coordination time
>                 not impact checkpointees.       with many tasks/threads. alters
>                                                 state of checkpointee if fails.
>                                                 e.g. cannot checkpoint when in
>                                                 vfork(), ptrace states, etc.

Our current support of vfork and ptrace has some of the issues that Oren points
out.  One example occurs if a process is in the kernel, and a ptrace state has
changed.  If it was important for some application, we would either have
to think of some "hack", or follow Tejun's alternative suggestion to work
with the developers to add further kernel support.  The kernel developers
on this list can estimate the difficulties of kernel support better than I can.
 
> COVERAGE        save/restore _all_ task state;  needs new ABI for everything:
>                 identify shared resources; can  expose state, provide means to
>                 extend for new kernel features  restore state (e.g. TCP protocol
>                 easily                          options negotiated with peers)

Currently, the only kernel support used by DMTCP is system calls (wrappers),
/proc/*/fd, /proc/*/maps, /proc/*/cmdline, /proc/*/exe, /proc/*/stat.  (I think
I've named them all now.)  The kernel developers will know better
than us what other kernel state one might want to support for C/R, and what
types of applications would need that.

> RELIABILITY     checkpoint w/ single syscall;   non-atomic, cannot find leaks
>                 atomic operation. guaranteed    to determine restartability
>                 restartability for containers

My understanding is that the guarantees apply for Linux containers, but not
for a tree of processes.  Does this imply that linux-cr would have some
of the same reliability issues as DMTCP for a tree of processes?  (I mean
the question sincerely, and am not intending to be rude.)  In any case,
won't DMTCP and Linux C/R have to handle orthogonal reliability issues
such as external database, time virtualization, and other examples
from our previous post?

> USERSPACE GLUE  possible                        possible
> 
> SECURITY        root and non-root modes         root and non-root modes
>                 native support for LSM
> 
> MAINTENANCE     changes mainly for features     changes mainly for features;
>                                                 create new ABI for features

> iAnd by all means, I intend to cooperate with Gene to see how to
> make the other part of DMTCP, namely the userspace "glue", work on
> top of linux-cr to have the benefits of all worlds !

This is true, and we strongly welcome the cooperation.  We don't know how
this experiment will turn out, but the only way to find out is to sincerely
try it.  Whether we succeed or fail, we will learn something either way!

- Gene and Kapil

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-21  8:18                                           ` Gene Cooperman
  2010-11-21  8:21                                             ` Gene Cooperman
@ 2010-11-21 22:41                                             ` Grant Likely
  2010-11-22 17:34                                             ` Oren Laadan
  2 siblings, 0 replies; 49+ messages in thread
From: Grant Likely @ 2010-11-21 22:41 UTC (permalink / raw)
  To: Gene Cooperman
  Cc: Tejun Heo, Oren Laadan, Kapil Arya, linux-kernel, xemul,
	Eric W. Biederman, Linux Containers

On Sun, Nov 21, 2010 at 03:18:53AM -0500, Gene Cooperman wrote:
> In this post, Kapil and I will provide our own summary of how we
> see the issues for discussion so far.  In the next post, we'll reply
> specifically to comment on Oren's table of comparison between
> linux-cr and userspace.
> 
> In general, we'd like to add that the conversation with Oren was very
> useful for us, and I think Oren will also agree that we were able to
> converge on the purely technical questions.

Hi Gene,

Thanks for the good summary, it helps.  Some random comments below...

> 
> Concerning opinions, we want to be cautious on opinions, since we're
> still learning the context of this ongoing discussion on LKML.  There is
> probably still some context that we're missing.
> 
> Below, we'll summarize the four major questions that we've understood from
> this discussion so far.  But before doing so, I want to point out that a single
> process or process tree will always have many possible interactions with
> the rest of the world.  Within our own group, we have an internal slogan:
>   "You can't checkpoint the world."
> A virtual machine can have a relatively closed world, which makes it more
> robust, but checkpointing will always have some fragile parts.
> We give four examples below: 
> a.  time virtualization
> b.  external database
> c.  NSCD daemon
> d.  screen and other full-screen text programs
> These are not the only examples of difficult interactions with the
> rest of the world.
> 
> Anyway, in my opinion, the conversation with Oren seemed to converge
> into two larger cases:
> 1.  In a pure userland C/R like DMTCP, how many corner cases are not handled,
> 	or could not be handled, in a pure userland approach?
> 	Also, how important are those corner cases?  Do some
> 	have important use cases that rise above just a corner case?
> 	[ inotify is one of those examples.  For DMTCP to support this,
> 	  it would have to put wrappers around inotify_add_watch,
> 	  inotify_rm_watch, read, etc., and maybe even tracking inodes in case
> 	  the file had been renamed after the inotify_add_watch.  Something
> 	  could be made to work for the common cases, but it would
> 	  still be a hack --- to be done only if a use case demands it. ]
> 2.  In a Linux C/R approach, it's already recognized that one needs
> 	a userland component (for example, for convenience of recreating
> 	the process tree on restart).  How many other cases are there
> 	that require a userland component?
> 	[ One example here is the shared memory segment of NSCD, which
> 	  has to be re-initialized on restart.  Another example is
> 	  a screen process that talks to an ANSI terminal emulator
> 	  (e.g. gnome-terminal), which talks to an X server or VNC server.
> 	  Below, we discuss these examples in more detail. ]
> 
> One can add a third and fourth question here:
> 
> 3.  [Originally posed by Oren] Given Linux C/R, how much work would
>         it be to add the higher layers of DMTCP on top of Linux C/R?
> 	[ This is a non-trivial question.  As just one example, DMTCP
> 	  handles sockets uniformly, regardless of whether they
> 	  are intra-host or inter-host.  Linux C/R handles certain
> 	  types of intra-host sockets.  So, merging the two would
> 	  require some thought. ]
> 4.  [Originally posed by Tejun, e.g. Fri Nov 19 2010 - 09:04:42 EST]
> 	Given that DMTCP checkpoints many common applications, how much work
> 	would it be to add a small number of restricted kernel interfaces
> 	to enable one to remove some of the hacks in DMTCP, and to cover
> 	the more important corner cases that DMTCP might be missing?
> 
> 
> I'd also like to add some points of my own here.  First, there are certain
> cases where I believe that a checkpoint-restart system (in-kernel
> or userland or hybrid) can never be completely transparent.  It's because you
> can't completely cut the connection with the rest of the world.  In these
> examples, I'm thinking primarily of the Linux C/R mode used to checkpoint
> a tree of processes.
>     To the extent that Linux C/R is used with containers, it seems
> to me to be closer to lightweight virtualization.  From there, I've
> seen that the conversation goes to comparing lightweight virtualization
> versus traditional virtual machines, but that discussion goes beyond my
> own personal expertise.

At the risk of restating already applied arguments, and as a c/r
outsider, this touches on the real crux of the issue for me.  What is
the complete set of boundaries between a c/r group of processes and
the outside world?  Is it bounded and is it understandable by mere
kernel engineers?  Does it change the assumptions about what a Linux
process /is/, and how to handle it?  How much?  The broad strokes seem
to be straight forward, but as already pointed out, the devil is in
the details.

> Here are some examples that I believe that every checkpointing system
> would suffer from the syndrome of trying to "checkpoint the world".
> 
> 1.  Time virtualization --- Right now, neither system does time virtualization.
> Both systems could do it.  But what is the right policy?
>     For example, one process may set a deadline for a task an hour
> in the future, and then periodically poll the kernel for the current time
> to see if one hour has passed.  This use case seems to require time
> virtualization.
>     A second process wants to know the current day and time, because a certain
> web service updates its information at midnight each day.  This use case seems
> seems to argue that time virtualization is bad.

Temporal issues need to be (are being?) addressed regardless.  In
certain respects, I'm sure c/r can be seen as a *really long*
scheduler latency, and would have the same effect as a system going
into suspend, or a vm-level checkpoint.  I would think the same
behaviour would be desirable in all cases, include c/r.

> 2.  External database file on another host --- It's not possible to
> checkpoint the remote database file.  In our work with the Condor developers,
> they asked us to add a "Condor mode", which says that if there are any
> external socket connections, then delay the checkpoint until the external
> socket connections are closed.  In a different joint project with CERN (Geneva),
> we considered a checkpointing application in which an application
> saves much of the database, and then on restart, discovers how much
> of its data is stale, and re-loads only the stale portion.
> 
> 3.  NSCD (Network Services Caching Daemon) --- Glibc arranges for
> certain information to be cached in the NSCD.  The information is
> in a memory segment shared between the NSCD and the application.
> Upon restart, the application doesn't know that the memory segment
> is no longer shared with the NSCD, or that the information is stale.
> The DMTCP "hack" is to zero out this memory page on restart.  Then glibc
> recognizes that it needs to create a new shared memory segment.

Right here is exactly the example of a boundary that needs explicit
rules.  When a pair of processes have a shared region, and only one of
them is checkpointed, then what is the behaviour on restore?  In this
specific example, a context-specific hack is used to achieve the
desired result, but that doesn't work (as I believe you agree) in the
general case.  What behaviour will in-kernel support need to enforce?

> 3.  screen --- The screen application sets the scrolling region of
> its ANSI terminal emulator, in order to create a status line
> at the bottom, while scrolling the remaining lines of the terminal.
> Upon restart, screen assumes that the scrolling region
> has already been set up, and doesn't have to be re-initialized.
> So, on restart, DMTCP uses SIGWINCH to fool screen (or any
> full-screen text-based application) into believing that its
> window size has been changed.  So, screen (or vim, or emacs)
> then re-initializes the state of its ANSI terminal, including
> scrolling regions and so on.
>     So, a userland component is helpful in doing the kind of hacks above.
> I recognize that the Linux C/R team agrees that some userland component
> can be useful.  I just want to show why some userland hacks will always be
> needed.  Let's consider a pure in-kernel approach to checkpointing 'screen'
> (or almost any full-screen application that uses a status bar at the bottom).
> Screen sets the scrolling region of an ANSI terminal emulator,
> which might be a gnome-terminal.  So, a pure in-kernel approach
> needs to also checkpoint the gnome-terminal.  But the gnome-terminal
> needs to talk to an X server.  So, now one also needs to start
> up inside a VNC server to emulate the X server.  So, either
> one adds a "hack" in userland to force screen to re-initialize
> its ANSI terminal emulator, or else one is forced to include
> an entire VNC server just to checkpoint a screen process. ]
> 
> Finally, this excerpt below from Tejun's post sums up our views too.  We don't
> have the kernel expertise of the people on this list, but we've had
> to do a little bit of reading the kernel code where the documentation
> was sparse and in teaching O/S.  We would certainly be very happy to work
> closely with the kernel developers, if there was interest in extending
> DMTCP to directly use more kernel support.
> 
> - Gene and Kapil
> 
> Tejun Heo wrote Fri Nov 19 2010 - 09:04:42 EST
> > What's so wrong with Gene's work? Sure, it has some hacky aspects but
> > let's fix those up. To me, it sure looks like much saner and
> > manageable approach than in-kernel CR. We can add nested ptrace,
> > CLONE_SET_PID (or whatever) in pidns, integrate it with various ns
> > supports, add an ability to adjust brk, export inotify state via
> > fdinfo and so on.
> > 
> > The thing is already working, the codebase of core part is fairly
> > small and condor is contemplating integrating it, so at least some
> > people in HPC segment think it's already viable. Maybe the HPC
> > cluster I'm currently sitting near is special case but people here
> > really don't run very fancy stuff. In most cases, they're fairly
> > simple (from system POV) C programs reading/writing data and burning a
> > _LOT_ of CPU cycles inbetween and admins here seem to think dmtcp
> > integrated with condor would work well enough for them.
> > 
> > Sure, in-kernel CR has better or more reliable coverage now but by how
> > much? The basic things are already there in userland.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-20 19:33                                         ` Tejun Heo
  2010-11-21  8:18                                           ` Gene Cooperman
@ 2010-11-22 17:18                                           ` Oren Laadan
  1 sibling, 0 replies; 49+ messages in thread
From: Oren Laadan @ 2010-11-22 17:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kapil Arya, Gene Cooperman, linux-kernel, xemul,
	Eric W. Biederman, Linux Containers

On Sat, 20 Nov 2010, Tejun Heo wrote:

> Hello,
> 
> On 11/20/2010 07:15 PM, Oren Laadan wrote:
> > 
> > [[apologies for the silly prefix on last two posts - a combination
> > of windows, putty, pine andslow connection is not helping me :( ]]
> 
> Maybe it's a good idea to post a clean concatenated version for later
> reference?
> 

Sure, as soon I am back on sane connection (~1 week)
(I cut it in three to make it easier for people to digest ...)

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-21  8:18                                           ` Gene Cooperman
  2010-11-21  8:21                                             ` Gene Cooperman
  2010-11-21 22:41                                             ` Grant Likely
@ 2010-11-22 17:34                                             ` Oren Laadan
  2 siblings, 0 replies; 49+ messages in thread
From: Oren Laadan @ 2010-11-22 17:34 UTC (permalink / raw)
  To: Gene Cooperman
  Cc: Tejun Heo, Kapil Arya, linux-kernel, xemul, Eric W. Biederman,
	Linux Containers

On Sun, 21 Nov 2010, Gene Cooperman wrote:

> Below, we'll summarize the four major questions that we've understood from
> this discussion so far.  But before doing so, I want to point out that a single
> process or process tree will always have many possible interactions with
> the rest of the world.  Within our own group, we have an internal slogan:
>   "You can't checkpoint the world."
> A virtual machine can have a relatively closed world, which makes it more
> robust, but checkpointing will always have some fragile parts.

That depends of what your definition of "world". One definition
is "world := VM", as you state above. Another is "world := container"
which I stated in my post(s). You can checkpoint both.

For those cases where the "world" cannot be fully checkpointed, 
I explicitly pointed  that we should focus on the core c/r 
functionality, because the "glue" can be done either way.

> We give four examples below: 
> a.  time virtualization

IMHO, irrelevant to current discussion. And btw, this is done in
linux-cr for live migration of tcp connections.

> b.  external database
> c.  NSCD daemon

This falls within the category of "glue", and is - as I try once
again to remind - tentirely oorthogonal to the topic of where
to do c/r.

> d.  screen and other full-screen text programs
> These are not the only examples of difficult interactions with the
> rest of the world.

This actually never required a userspace "component" with Zap
or linux-cr (to the best of my knowledge)..

Even if it did - the question is not how to deal with "glue"
(you demonstrated quite well how to do that with DMTCP), but 
how should teh basic, core c/r functionality work - which is
below, and orthogonal to the "glue".

Let us please focus on the base c/r engine functionality...

(gotta disconnect now .. more later)

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-21  8:21                                             ` Gene Cooperman
@ 2010-11-22 18:02                                               ` Sukadev Bhattiprolu
  2010-11-23 17:53                                               ` Oren Laadan
  1 sibling, 0 replies; 49+ messages in thread
From: Sukadev Bhattiprolu @ 2010-11-22 18:02 UTC (permalink / raw)
  To: Gene Cooperman
  Cc: Kapil Arya, linux-kernel, xemul, Linux Containers,
	Eric W. Biederman, Tejun Heo

Gene Cooperman [gene@ccs.neu.edu] wrote:
| > RELIABILITY     checkpoint w/ single syscall;   non-atomic, cannot find leaks
| >                 atomic operation. guaranteed    to determine restartability
| >                 restartability for containers
| 
| My understanding is that the guarantees apply for Linux containers, but not
| for a tree of processes.  Does this imply that linux-cr would have some
| of the same reliability issues as DMTCP for a tree of processes?  (I mean
| the question sincerely, and am not intending to be rude.)  In any case,
| won't DMTCP and Linux C/R have to handle orthogonal reliability issues
| such as external database, time virtualization, and other examples
| from our previous post?

Yes if the user attempts to checkpoint a partial container (what we refer
to process subtree) or fails to snapshot/restore filesystem there could be
leaks that we cannot detect.

But one guarantee we are trying to provide is that if the user checkpoints
a _complete_ container, then we will detect a leak if one exists.

Is there a way to establish a set of constraints (eg: run application in a
container, snapshot/restore filesystem) and then provide leak detection with
a pure userpsace implementation ?

Sukadev

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-21  8:21                                             ` Gene Cooperman
  2010-11-22 18:02                                               ` Sukadev Bhattiprolu
@ 2010-11-23 17:53                                               ` Oren Laadan
  2010-11-24  3:50                                                 ` Kapil Arya
  1 sibling, 1 reply; 49+ messages in thread
From: Oren Laadan @ 2010-11-23 17:53 UTC (permalink / raw)
  To: Gene Cooperman
  Cc: Tejun Heo, Kapil Arya, linux-kernel, xemul, Eric W. Biederman,
	Linux Containers

On Sun, 21 Nov 2010, Gene Cooperman wrote:

> As Kapil and I wrote before, we benefited greatly from having talked with Oren,
> and learning some more about the context of the discussion.  We were able
> to understand better the good technical points that Oren was making.
>     Since the comparison table below concerns DMTCP, we'd like to
> state some additional technical points that could affect the conlusions.
> 
> > category        linux-cr                        userspace
> > --------------------------------------------------------------------------------
> > PERFORMANCE     has _zero_ runtime overhead     visible overhead due to syscalls
> >                                                 interposition and state tracking
> >                                                 even w/o checkpoints;
> 
> In our experiments so far, the overhead of system calls has been
> unmeasurable.  We never wrap read() or write(), in order to keep overhead low.
> We also never wrap pthread synchronization primitives such as locks,
> for the same reason.  The other system calls are used much less often, and so
> the overhead has been too small to measure in our experiments.

Syscall interception will have visible effect on applications that
use those syscalls. You may not observe overheasd with HPC ones,
but do you have numbers on server apps ?  apps that use fork/clone
and pipes extensively ?  threads benchmarks et ?  compare that
to aboslute zero overhead of linux-cr.

> 
> > OPTIMIZATIONS   many optimizations possible     limited, less effective
> >                 only in kernel, for downtime,   w/ much larger overhead.
> >                 image size, live-migration
>  
> As above, we believe that the overhead while running is negligible.  I'm

For the HPC apps that you use.

> assuming that image size refers to in-kernel advantages for incremental
> checkpointing.  This is useful for apps where the modified pages tend
> not to dominate.  We agree with this point.  As an orthogonal point,
> by default DMTCP compresses all checkpoint images using gzip on the fly.
> This is useful even when most pages are modified between checkpoints.
> Still, as Oren writes, Linux C/R could also add a userland component
> to compress checkpoint images on the fly.

This is not "userland component", it's "checkpoint | gzip > image.out"...

>     Next, live migration is a question that we simply haven't thought much
> about.  If it's important, we could think about what userland approaches might
> exist, but we have no near-term plans to tackle live migration.

As it is, live-migration _is_ a very important use case.

> 
> > OPERATION       applications run unmodified     to do c/r, needs 'controller'
> >                                                 task (launch and manage _entire_
> >                                                 execution) - point of failure.
> >                                                 restricts how a system is used.
> 
> We'd like to clarify what may be some misconceptions.  The DMTCP
> controller does not launch or manage any tasks.  The DMTCP controller
> is stateless, and is only there to provide a barrier, namespace server,
> and single point of contact to relay ckpt/restart commands.  Recall that
> the DMTCP controller handls processes across hosts --- not just on a
> single host.

The controller is another point of failure. I already pointed that
the (controlled) application crashes when your controller dies, and
you mentioned it's a bug that should be fixed. But then there will always 
be a risk for another, and another ...   You also mentioned that if the
controller dies, then the app should contionue to run, but will not be
checkpointable anymore (IIUC).

The point is, that the controller is another point of failure, and makes 
the execution/checkpoint intrusive. It also adds security and 
user-management issues as you'll need one (or more ?) controller per user 
(right now, it's one for all, no ?). and so on.

Plus, because the restarted apps get their virtualized IDs from the 
controller, then they can't now "see" existing/new processes that
may get the "same" pids (virtualization is not in the kernel).


>     Also, in any computation involving multiple processes, _every_ process
> of the computation is a point of failure.  If any process of the computation
> dies, then the simple application strategy is to give up and revert to an
> earlier checkpoint.  There are techniques by which an app or DMTCP can
> recreate certain failed processes.  DMTCP doesn't currently recreate
> a dead controller (no demand for it), but it's not hard to do technically.

The point is that you _add_ a point of failure: you make the "checkpoint" 
operation a possible reason for the application to crash. In contrast, in 
linux-cr the checkpoiint is idempotent - nunharmful because it does not 
make the applications execute. Instead, it merely observes their state.


> > PREEMPTIVE      checkpoint at any time, use     processes must be runnable and
> >                 auxiliary task to save state;   "collaborate" for checkpoint;
> >                 non-intrusive: failure does     long task coordination time
> >                 not impact checkpointees.       with many tasks/threads. alters
> >                                                 state of checkpointee if fails.
> >                                                 e.g. cannot checkpoint when in
> >                                                 vfork(), ptrace states, etc.
> 
> Our current support of vfork and ptrace has some of the issues that Oren points
> out.  One example occurs if a process is in the kernel, and a ptrace state has
> changed.  If it was important for some application, we would either have
> to think of some "hack", or follow Tejun's alternative suggestion to work
> with the developers to add further kernel support.  The kernel developers
> on this list can estimate the difficulties of kernel support better than I can.
>  
> > COVERAGE        save/restore _all_ task state;  needs new ABI for everything:
> >                 identify shared resources; can  expose state, provide means to
> >                 extend for new kernel features  restore state (e.g. TCP protocol
> >                 easily                          options negotiated with peers)
> 
> Currently, the only kernel support used by DMTCP is system calls (wrappers),
> /proc/*/fd, /proc/*/maps, /proc/*/cmdline, /proc/*/exe, /proc/*/stat.  (I think
> I've named them all now.)  The kernel developers will know better
> than us what other kernel state one might want to support for C/R, and what
> types of applications would need that.
> 
> > RELIABILITY     checkpoint w/ single syscall;   non-atomic, cannot find leaks
> >                 atomic operation. guaranteed    to determine restartability
> >                 restartability for containers
> 
> My understanding is that the guarantees apply for Linux containers, but not
> for a tree of processes.  Does this imply that linux-cr would have some
> of the same reliability issues as DMTCP for a tree of processes?  (I mean
> the question sincerely, and am not intending to be rude.)  In any case,
> won't DMTCP and Linux C/R have to handle orthogonal reliability issues
> such as external database, time virtualization, and other examples
> from our previous post?

There are two points in the claim above:

1) linux-cr can checkpoint with a single syscall - it's atomic. This
gives you more guarantees about the consistency of the checkpointed 
application(s), and less "opportunitites" for the operation as a whole to 
fail.

2) restartability - for full-container checkpoint only.

There is no "reliability" issue with c/r of non-containers - it's a matter 
of definition: it depends on what your requirements from the userspace 
application and what sort of "glue" you have for it.
 
And I request again - let's leave out the questions of "time 
virtualization" and "external databases" - how are they different for the 
VM virtalization solution ?  they are conpletely orthogonal to the 
question we are debating.

Thanks,

Oren.

 > 
> > USERSPACE GLUE  possible                        possible
> > 
> > SECURITY        root and non-root modes         root and non-root modes
> >                 native support for LSM
> > 
> > MAINTENANCE     changes mainly for features     changes mainly for features;
> >                                                 create new ABI for features
> 
> > iAnd by all means, I intend to cooperate with Gene to see how to
> > make the other part of DMTCP, namely the userspace "glue", work on
> > top of linux-cr to have the benefits of all worlds !
> 
> This is true, and we strongly welcome the cooperation.  We don't know how
> this experiment will turn out, but the only way to find out is to sincerely
> try it.  Whether we succeed or fail, we will learn something either way!
> 
> - Gene and Kapil
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-23 17:53                                               ` Oren Laadan
@ 2010-11-24  3:50                                                 ` Kapil Arya
  2010-11-25 16:04                                                   ` Oren Laadan
  0 siblings, 1 reply; 49+ messages in thread
From: Kapil Arya @ 2010-11-24  3:50 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Gene Cooperman, Tejun Heo, linux-kernel, xemul, Eric W. Biederman,
	Linux Containers

(Our first comment below actually replies to an earlier post by Oren. It seemed
simpler to combine our comments.)

> > d.  screen and other full-screen text programs These are not the only
> > examples of difficult interactions with the rest of the world.
>
> This actually never required a userspace "component" with Zap or linux-cr (to
> the best of my knowledge).

We would guess that Zap would not be able to support screen without a user
space component. The bug occurs when screen is configured to have a status line
at the bottom. We would be interested if you want to try it and let us know the
results.

=============================================

> > > category        linux-cr
> > > userspace
> > > --------------------------------------------------------------------------------
> > > PERFORMANCE     has _zero_ runtime overhead     visible overhead due to
> > > syscalls interposition and state tracking even w/o checkpoints;
> >
> > In our experiments so far, the overhead of system calls has been
> > unmeasurable.  We never wrap read() or write(), in order to keep overhead
> > low.  We also never wrap pthread synchronization primitives such as locks,
> > for the same reason.  The other system calls are used much less often, and
> > so the overhead has been too small to measure in our experiments.
>
> Syscall interception will have visible effect on applications that use those
> syscalls. You may not observe overheasd with HPC ones, but do you have
> numbers on server apps ?  apps that use fork/clone and pipes extensively ?
> threads benchmarks et ?  compare that to aboslute zero overhead of linux-cr.

Its true that we haven't taken serious data on overhead with server apps. Is
there a particular server app that you are thinking of as an example? I would
expect fork/clone and pipes to be invoked infrequently in the server apps and do
not add measurably to CPU time. In most server apps such as MySQL, it is
common to maintain a pool of threads for reuse rather than to repeatedly call
clone for a new thread. This is done to ensure that the overhead of the clone
calls is not significant. I would expect a similar policy for fork and pipes.

<snip>

> > > OPERATION       applications run unmodified     to do c/r, needs
> > > 'controller' task (launch and manage _entire_ execution) - point of
> > > failure.  restricts how a system is used.
> >
> > We'd like to clarify what may be some misconceptions.  The DMTCP controller
> > does not launch or manage any tasks.  The DMTCP controller is stateless,
> > and is only there to provide a barrier, namespace server, and single point
> > of contact to relay ckpt/restart commands.  Recall that the DMTCP
> > controller handls processes across hosts --- not just on a single host.
>
> The controller is another point of failure. I already pointed that the
> (controlled) application crashes when your controller dies, and you mentioned
> it's a bug that should be fixed. But then there will always be a risk for
> another, and another ...   You also mentioned that if the controller dies,
> then the app should contionue to run, but will not be checkpointable anymore
> (IIUC).
>
> The point is, that the controller is another point of failure, and makes the
> execution/checkpoint intrusive. It also adds security and user-management
> issues as you'll need one (or more ?) controller per user (right now, it's
> one for all, no ?). and so on.

Just to clarify, DMTCP uses one coordinator for each checkpointable
computation. A single user may be running multiple computations with one
coordinator for each computation. We don't actually use the word controller
in DMTCP terminology because the coordinator is stateless and so in
coordinating but not controlling other processes.

> Plus, because the restarted apps get their virtualized IDs from the
> controller, then they can't now "see" existing/new processes that may get the
> "same" pids (virtualization is not in the kernel).

This appears to be a misconception. The wrappers within the user process
maintain the pid-translation table for that process. The translation table is
the translation between the original pid given by the kernel and the current
pid set by the kernel on restart. This is handled locally and does not involve
the coordinator.

In the case of a fork there could be a pid-clash (the original pid
generated for a
new process that conflicts with someone else's original pid). However, DMTCP
handles this by checking within the fork wrapper for a pid-clash. In the rare
case of a pid-clash, the child process exits and the parent forks again. Same
applies for clone and any pid clash at restart time.

> >     Also, in any computation involving multiple processes, _every_ process
> >     of the computation is a point of failure.  If any process of the
> >     computation dies, then the simple application strategy is to give up
> >     and revert to an earlier checkpoint.  There are techniques by which an
> >     app or DMTCP can recreate certain failed processes.  DMTCP doesn't
> >     currently recreate a dead controller (no demand for it), but it's not
> >     hard to do technically.
>
> The point is that you _add_ a point of failure: you make the "checkpoint"
> operation a possible reason for the application to crash. In contrast, in
> linux-cr the checkpoiint is idempotent - nunharmful because it does not make
> the applications execute. Instead, it merely observes their state.

We were speaking above of the case when the process dies during a
computation. We were not referring to checkpoint time.

<snip>

We would like to add our own comment/question. To set the context we quote an
earlier post:

OL> Even if it did - the question is not how to deal with "glue"
OL> (you demonstrated quite well how to do that with DMTCP), but
OL> how should teh basic, core c/r functionality work - which is
OL> below, and orthogonal to the "glue".

There seems to be an implicit assumption that it is easy to separate the DMTCP
"glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but
it splits the problems into modules along a different line than Linux C/R. We
look forward to the joint experiment in which we would try to combine DMTCP
with Linux C/R. This will help answer the question in our mind.

In order to explore the issue, let's imagine that we have a successful merge of
DMTCP and Linux C/R. The following are some user-space glue issues. It's not
obvious to us how the merged software will handle these issues.

1. Sockets -- DMTCP handles all sockets in a common manner through a single
module. Sockets are checkpointed independently of whether they are local or
remote. In a merger of DMTCP and Linux C/R, what does Linux C/R do when it sees
remote sockets? Or should DMTCP take down all remote sockets before
checkpointing? If DMTCP has to do this, it would be less efficient than the
current design which keeps the remote sockets connections alive during
checkpoint.

2. XLib and X11-server -- Consider checkpointing a single X11 app without the
X11-server and without VNC. This is something we intend to add to DMTCP in the
next few months. We have already mapped out the design in our minds. An X11
application includes the Xlib library. The data of an X11 window is, by
default, contained in the X11 library -- not in the X11-server. The application
communicates with the X11-server using socket connections, which would be
considered a leak by Linux C/R. At restart time, DMTCP will ask the
X11-server to create a bare window and then make the appropriate Xlib call to
repaint the window based on the data stored in the Xlib  library.
For checkpoint/resume, the window stays up and does not has to be repainted.
How will the combined DMTCP/Linux C/R work? Will DMTCP have to take
down the window prior to Linux C/R and paint a new window at resume time?
Doesn't this add inefficiency?

3. Checkpointing a single process (e.g. a bash shell) talking to an xterm via
a pty -- We assume that from the viewpoint of Linux C/R a pty is a leak since
there is a second process operating the master end of the pty. In this
case we are
guessing that Linux C/R would checkpoint and restart without the gurantees of
reliability. We are guessing that Linux C/R would not save and restore the pty,
instead it would be the responsibility of DMTCP to restore the current settings
of the pty (e.g. packet mode vs. regular mode). Is our understanding correct?
Would this work?

Thanks,
Gene and Kapil

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-24  3:50                                                 ` Kapil Arya
@ 2010-11-25 16:04                                                   ` Oren Laadan
  2010-11-29  4:09                                                     ` Gene Cooperman
  0 siblings, 1 reply; 49+ messages in thread
From: Oren Laadan @ 2010-11-25 16:04 UTC (permalink / raw)
  To: Kapil Arya
  Cc: Gene Cooperman, Tejun Heo, linux-kernel, xemul, Eric W. Biederman,
	Linux Containers

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4441 bytes --]

On Tue, 23 Nov 2010, Kapil Arya wrote:

> OL> Even if it did - the question is not how to deal with "glue"
> OL> (you demonstrated quite well how to do that with DMTCP), but
> OL> how should teh basic, core c/r functionality work - which is
> OL> below, and orthogonal to the "glue".
> 
> There seems to be an implicit assumption that it is easy to separate the DMTCP
> "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but
> it splits the problems into modules along a different line than Linux C/R. We
> look forward to the joint experiment in which we would try to combine DMTCP
> with Linux C/R. This will help answer the question in our mind.

I apologize for being blunt - but this is probably an issue specific to 
DMTCP's engineering...

> In order to explore the issue, let's imagine that we have a successful merge of
> DMTCP and Linux C/R. The following are some user-space glue issues. It's not
> obvious to us how the merged software will handle these issues.
> 
> 1. Sockets -- DMTCP handles all sockets in a common manner through a single
> module. Sockets are checkpointed independently of whether they are local or
> remote. In a merger of DMTCP and Linux C/R, what does Linux C/R do when it sees
> remote sockets? Or should DMTCP take down all remote sockets before
> checkpointing? If DMTCP has to do this, it would be less efficient than the
> current design which keeps the remote sockets connections alive during
> checkpoint.

What is a "local" socket ?  af_unix, or locally connected af_inet ?

Anyway, with linux-cr you'd do what's needed after the restarted tasks are
created, but before their state is restored. For each such "old" socket
that you want to replace, you'd create (in userspace with arbitrary glue" 
code!) a new socket, and use this socket when restoring the state of the
task. Similarly, you could replace any other resource, not only sockets.

> 
> 2. XLib and X11-server -- Consider checkpointing a single X11 app without the
> X11-server and without VNC. This is something we intend to add to DMTCP in the
> next few months. We have already mapped out the design in our minds. An X11
> application includes the Xlib library. The data of an X11 window is, by
> default, contained in the X11 library -- not in the X11-server. The application
> communicates with the X11-server using socket connections, which would be
> considered a leak by Linux C/R. At restart time, DMTCP will ask the
> X11-server to create a bare window and then make the appropriate Xlib call to
> repaint the window based on the data stored in the Xlib  library.
> For checkpoint/resume, the window stays up and does not has to be repainted.
> How will the combined DMTCP/Linux C/R work? Will DMTCP have to take
> down the window prior to Linux C/R and paint a new window at resume time?
> Doesn't this add inefficiency?

Repainting during restart is the least of your problems.

Leak detection is not a problem: 
If the socket connects out of the containers (like af_inet) - then it is 
not a leak, andyou treat it as described above.
If the sockets connects within the container but you don't checkpoint the
"peer" process - then it is not a container-c/r (in which case you don't 
look for leaks).

Also, the application could mark resources to not be checkpointed (e.g. 
scratch memory to save storage, or sockets to not count as leaks).

I don't see any problem with X11 or any other library and "glue".

> 
> 3. Checkpointing a single process (e.g. a bash shell) talking to an xterm via
> a pty -- We assume that from the viewpoint of Linux C/R a pty is a leak since
> there is a second process operating the master end of the pty. In this
> case we are
> guessing that Linux C/R would checkpoint and restart without the gurantees of
> reliability. We are guessing that Linux C/R would not save and restore the pty,
> instead it would be the responsibility of DMTCP to restore the current settings
> of the pty (e.g. packet mode vs. regular mode). Is our understanding correct?
> Would this work?

I explain again - in case it wasn't clear from my 3-part post: leak 
detection is relevant _only_ for full container-c/r. It doesn't make 
sense otherwise.

If you want to checkpoint individual components of an application,
then it's up to userspace to produce/provide the relevant "glue" to
make it "make sense" when those components restart without their 
original eco-system.

Thanks,

Oren.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
  2010-11-25 16:04                                                   ` Oren Laadan
@ 2010-11-29  4:09                                                     ` Gene Cooperman
  0 siblings, 0 replies; 49+ messages in thread
From: Gene Cooperman @ 2010-11-29  4:09 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Kapil Arya, Gene Cooperman, Tejun Heo, linux-kernel, xemul,
	Eric W. Biederman, Linux Containers

Hi Oren,

On Thu, Nov 25, 2010 at 11:04:16AM -0500, Oren Laadan wrote:
> On Tue, 23 Nov 2010, Kapil Arya wrote:
> 
> > OL> Even if it did - the question is not how to deal with "glue"
> > OL> (you demonstrated quite well how to do that with DMTCP), but
> > OL> how should teh basic, core c/r functionality work - which is
> > OL> below, and orthogonal to the "glue".
> > 
> > There seems to be an implicit assumption that it is easy to separate the DMTCP
> > "glue code" from the DMTCP C/R engine as separate modules. DMTCP is modular but
> > it splits the problems into modules along a different line than Linux C/R. We
> > look forward to the joint experiment in which we would try to combine DMTCP
> > with Linux C/R. This will help answer the question in our mind.
> 
> I apologize for being blunt - but this is probably an issue specific to 
> DMTCP's engineering...
> 

I completely agree with you, Oren.  DMTCP was never designed to be split
into a userland and in-kernel replacement.  We will want to re-factor
DMTCP to make this happen.
    I'm sorry if my e-mail came off as confrontational.  That was not my
intention.  I was just looking forward to an interesting intellectual
experiment --- how to go about combining DMTCP and Linux C/R.   I was
trying to guess ahead of time where there are interesting challenges, and
my hope is that we will find a way to solve them together.

Best wishes,
- Gene

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2010-11-29  4:09 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.64.1011021530470.12128@takamine.ncl.cs.columbia.edu>
     [not found] ` <4CD08419.5050803@kernel.org>
     [not found]   ` <AANLkTinOg6n3ZA+0gHzw9LouRuUmJ7DJwHtABRy5c=gM@mail.gmail.com>
     [not found]     ` <4CD26948.7050009@kernel.org>
     [not found]       ` <20101104164401.GC10656@sundance.ccs.neu.edu>
     [not found]         ` <4CD3CE29.2010105@kernel.org>
     [not found]           ` <20101106053204.GB12449@count0.beaverton.ibm.com>
     [not found]             ` <20101106204008.GA31077@sundance.ccs.neu.edu>
2010-11-07 21:44               ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Oren Laadan
2010-11-07 23:31                 ` Gene Cooperman
     [not found]               ` <4CD5D99A.8000402@cs.columbia.edu>
     [not found]                 ` <20101107184927.GF31077@sundance.ccs.neu.edu>
     [not found]                   ` <20101107184927.GF31077-Rl5vdzG4YPwx/1z6v04GWfZ8FUJU4vz8@public.gmane.org>
2010-11-07 21:59                     ` Oren Laadan
2010-11-17 11:57                       ` Tejun Heo
2010-11-17 15:39                         ` Serge E. Hallyn
2010-11-17 15:46                           ` Tejun Heo
2010-11-18  9:13                             ` Pavel Emelyanov
     [not found]                               ` <4CE4EE21.6050305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-11-18  9:48                                 ` Tejun Heo
2010-11-18 20:13                                   ` Jose R. Santos
2010-11-19  3:54                                   ` Serge Hallyn
2010-11-18 19:53                             ` Oren Laadan
2010-11-19  4:10                             ` Serge Hallyn
2010-11-19 14:04                               ` Tejun Heo
2010-11-20 18:05                                 ` Oren Laadan
     [not found]                                 ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2010-11-19 14:36                                   ` Kirill Korotaev
     [not found]                                     ` <04F4899E-B5C7-4BAF-8F2F-05D507A91408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-11-19 15:33                                       ` Tejun Heo
2010-11-19 16:00                                         ` Alexey Dobriyan
2010-11-19 16:01                                           ` Alexey Dobriyan
2010-11-19 16:10                                             ` Tejun Heo
2010-11-19 16:25                                               ` Alexey Dobriyan
2010-11-19 16:06                                           ` Tejun Heo
2010-11-19 16:16                                             ` Alexey Dobriyan
2010-11-19 16:19                                               ` Tejun Heo
2010-11-19 16:27                                                 ` Alexey Dobriyan
     [not found]                                                   ` <AANLkTin7kd3crS+fTLLea5PhAii7B3dz=n7p7YtQ6d4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-19 16:32                                                     ` Tejun Heo
2010-11-19 16:38                                                       ` Alexey Dobriyan
2010-11-19 16:50                                                         ` Tejun Heo
2010-11-19 16:55                                                           ` Alexey Dobriyan
2010-11-20 17:58                                         ` Oren Laadan
2010-11-20 18:08                                   ` Oren Laadan
2010-11-20 18:11                                   ` Oren Laadan
     [not found]                                     ` <4CE69B8C.6050606-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-11-20 18:15                                       ` Oren Laadan
2010-11-20 19:33                                         ` Tejun Heo
2010-11-21  8:18                                           ` Gene Cooperman
2010-11-21  8:21                                             ` Gene Cooperman
2010-11-22 18:02                                               ` Sukadev Bhattiprolu
2010-11-23 17:53                                               ` Oren Laadan
2010-11-24  3:50                                                 ` Kapil Arya
2010-11-25 16:04                                                   ` Oren Laadan
2010-11-29  4:09                                                     ` Gene Cooperman
2010-11-21 22:41                                             ` Grant Likely
2010-11-22 17:34                                             ` Oren Laadan
2010-11-22 17:18                                           ` Oren Laadan
2010-11-17 22:17                         ` Matt Helsley
     [not found]                           ` <20101117221713.GA27736-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-11-18 10:06                             ` Tejun Heo
2010-11-18 20:25                             ` Oren Laadan
     [not found]           ` <AANLkTimDXKsBCxbsEOfgkYV2R8FK=bhFdmx9UQow5hqp@mail.gmail.com>
     [not found]             ` <4CD5DCE3.3000109@cs.columbia.edu>
     [not found]               ` <20101107194222.GG31077@sundance.ccs.neu.edu>
     [not found]                 ` <4CD71A6B.3020905@cs.columbia.edu>
     [not found]                   ` <20101107230516.GJ31077@sundance.ccs.neu.edu>
     [not found]                     ` <4CD774CA.8030004@cs.columbia.edu>
     [not found]                       ` <20101108162630.GN31077@sundance.ccs.neu.edu>
2010-11-08 18:14                         ` Oren Laadan
2010-11-08 18:37                           ` Gene Cooperman
2010-11-08 19:34                             ` Oren Laadan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox