From: Oren Laadan <orenl@cs.columbia.edu>
To: Gene Cooperman <gene@ccs.neu.edu>
Cc: Kapil Arya <kapil@ccs.neu.edu>, Tejun Heo <tj@kernel.org>,
ksummit-2010-discuss@lists.linux-foundation.org,
linux-kernel@vger.kernel.org, hch@lst.de
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
Date: Sun, 07 Nov 2010 16:30:19 -0500 [thread overview]
Message-ID: <4CD71A6B.3020905@cs.columbia.edu> (raw)
In-Reply-To: <20101107194222.GG31077@sundance.ccs.neu.edu>
On 11/07/2010 02:42 PM, Gene Cooperman wrote:
> I'd like to add a few clafifications, below, about DMTCP concerning
> Oren's comments. I'd also like to point out that we've had about 100
> downloads per month from sourceforge (and some interesting use cases
> from end users) over the last year (although the sourceforge numbers
> do go up and down :-) ). In general, I think we'll all understand the
> situation better after having had the opportunity to talk offline.
> Below are some clarifications about DMTCP.
> ===
>
>> For example, in your example, you'd need to wrap the library calls
>> (e.g. of MPI implementation) and replaced them to use TCP/IP or
>> infiniband. Wrapping on system calls won't help you.
>
> We do not put any wrappers around MPI library calls. MPI calls things
> like open, close, connect, listen, execve({"ssh", ...}, ...), etc.
> At this time, DMTCP adds wrappers _only_ around calls to libc.so
> and libpthread.so . This is sufficient to checkpoint a distributed
> computation like MPI.
Of course. And you don't need syscall virtualization for this.
Zap did it already many years ago :) Only problem with the above
is that, conveniently enough, you _left out_ the context:
>> For example,
>> if a distributed computation runs over infiniband, can we migrate to
a TCP/IP
>> cluster. For this, one needs the flexibility of wrappers around
system calls.
Do you also support checkpoint a distributed app that uses an
infiniband MPI stack and restart it with a TCP based MPI stack ?
Can you do it with only syscall wrapping and without knowledge
on the MPI implementation and some MPI-specific logic in the
wrappers ? I'm curious how you do that without wrapping around
MPI calls, or without an c/r-aware implementation of MPI.
Again, this is unrelated to how you do the core c/r work. I think
we both agree that _this_ kind of app-wrappers/app-awareness is
useful for certain uses of c/r.
[snip]
>> So I'll repeat the question I asked there: is re-reimplementing
>> chunks of kernel functionality and all namespaces in userspace
>> the way to go ?
>
> If you're referring to interposition here, that takes place essentially
> in the wrappers, and the wrappers are only 3000 lines of code in DMTCP.
> Also, I don't believe that we're "re-implementing chunks of kernel
> functionality", but let's continue that discussion offline.
The interposition itself is relatively simple (though not atomic).
The problem is the logic to "spy" on and "lie" to the applications.
Examples: saving ptrace state, saving FD_CLOEXEC flag, correctly
maintaining a userspace pid-ns, etc.
[...]
>
>> ... (yes, transparent means that
>> it does not require LD_PRELOAD or collaboration of the application!
>> nor does it require userspace virtualizations of so many things
>> already provided by the kernel today), more generic, more flexible,
>> provides more guarantees, cover more types or states of resources,
>> and can perform significantly better.
>
> I still haven't understood why you object to the DMTCP use of LD_PRELOAD.
> How will the user app ever know that we used LD_PRELOAD, since we remove
> LD_PRELOAD from the environment before the user app libraries and main
> can begin? And, if you really object to LD_PRELOAD, then there are
> other ways to capture control. Similarly, I'll have to understand better
I don't object to it per se - it's actually pretty useful oftentimes.
But in our context, it has limitations. For example, it does not
cover static applications, nor apps that call syscalls directly
using int 0x80. Also, it conflicts with LD_PRELOAD possibly needed
for other software (like valgrind) - for which again you would need
yet another per-app wrapper, at the very least.
> what you mean by the _collaboration of the application_. DMTCP operates
> on unmodified application binaries.
I mean that the applications needs to be scheduled and to run to
participate in its own checkpoint. You use syscall interposition
and signals games to do exactly that - gain control over the app
and run your library's code. This has at least three negatives:
first, some apps don't want to or can't run - e.g. ptraced, or
swapped (think incremental checkpoint: why swap everything in ?!);
Second, the coordination can take significant time, especially if
many tasks/threads and resources are involved; Third, it modifies
the state of the app - if something goes wrong while you use c/r
to migrate an app, you impact the app.
(While 'ptrace' relieves you from the need for "collaboration"
of processes, but doesn't address the other problems and adds
its own issues).
> Basically, if _transparent_ means
> that one is not allowed to use anything at all from userland, then I
> agree with you that no userland checkpointing can ever be transparent.
> But, I think that's a biased definition of _transparent_. :-)
"Transparent" c/r means "invisible" to the user/apps, i.e. that
you don't restrict the user or the app in what they do and how
they do it.
Did you ever try to 'ltrace skype' ? there exists useful and
popular software that doesn't like being spied after...
Oren.
next prev parent reply other threads:[~2010-11-07 21:27 UTC|newest]
Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <Pine.LNX.4.64.1011021530470.12128@takamine.ncl.cs.columbia.edu>
2010-11-02 21:35 ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Tejun Heo
2010-11-02 21:47 ` Christoph Hellwig
2010-11-04 1:47 ` Nathan Lynch
2010-11-04 7:36 ` Tejun Heo
2010-11-04 16:04 ` Gene Cooperman
2010-11-04 20:45 ` Nathan Lynch
2010-11-06 6:48 ` Matt Helsley
2010-11-04 4:34 ` Oren Laadan
2010-11-04 14:25 ` Christoph Hellwig
2010-11-04 3:40 ` Kapil Arya
2010-11-04 8:05 ` Tejun Heo
2010-11-04 16:44 ` Gene Cooperman
2010-11-05 9:28 ` Tejun Heo
2010-11-05 23:18 ` Oren Laadan
2010-11-06 10:13 ` Tejun Heo
2010-11-06 0:36 ` Kapil Arya
2010-11-06 22:55 ` Oren Laadan
2010-11-07 19:42 ` Gene Cooperman
2010-11-07 21:30 ` Oren Laadan [this message]
2010-11-07 23:05 ` Gene Cooperman
2010-11-08 3:55 ` Oren Laadan
2010-11-08 16:26 ` Gene Cooperman
2010-11-08 18:14 ` Oren Laadan
2010-11-08 18:37 ` Gene Cooperman
2010-11-08 19:34 ` Oren Laadan
2010-11-08 19:05 ` Dan Smith
2010-11-17 11:14 ` Tejun Heo
2010-11-17 15:33 ` Dan Smith
2010-11-17 15:40 ` Tejun Heo
2010-11-17 17:04 ` Alexey Dobriyan
2010-11-17 10:45 ` Tejun Heo
2010-11-17 12:12 ` Tejun Heo
2010-11-06 5:32 ` Matt Helsley
2010-11-06 15:01 ` Oren Laadan
2010-11-06 20:40 ` Gene Cooperman
2010-11-06 22:41 ` Oren Laadan
2010-11-07 18:49 ` Gene Cooperman
2010-11-07 21:59 ` Oren Laadan
2010-11-17 11:57 ` Tejun Heo
2010-11-17 15:39 ` Serge E. Hallyn
2010-11-17 15:46 ` Tejun Heo
2010-11-18 9:13 ` Pavel Emelyanov
2010-11-18 9:48 ` Tejun Heo
2010-11-18 20:13 ` Jose R. Santos
2010-11-19 3:54 ` Serge Hallyn
2010-11-18 19:53 ` Oren Laadan
2010-11-19 4:10 ` Serge Hallyn
2010-11-19 14:04 ` Tejun Heo
2010-11-19 14:36 ` Kirill Korotaev
2010-11-19 15:33 ` Tejun Heo
2010-11-19 16:00 ` Alexey Dobriyan
2010-11-19 16:01 ` Alexey Dobriyan
2010-11-19 16:10 ` Tejun Heo
2010-11-19 16:25 ` Alexey Dobriyan
2010-11-19 16:06 ` Tejun Heo
2010-11-19 16:16 ` Alexey Dobriyan
2010-11-19 16:19 ` Tejun Heo
2010-11-19 16:27 ` Alexey Dobriyan
2010-11-19 16:32 ` Tejun Heo
2010-11-19 16:38 ` Alexey Dobriyan
2010-11-19 16:50 ` Tejun Heo
2010-11-19 16:55 ` Alexey Dobriyan
2010-11-20 17:58 ` Oren Laadan
2010-11-20 18:05 ` Oren Laadan
2010-11-20 18:08 ` Oren Laadan
2010-11-20 18:11 ` Oren Laadan
2010-11-20 18:15 ` Oren Laadan
2010-11-20 19:33 ` Tejun Heo
2010-11-21 8:18 ` Gene Cooperman
2010-11-21 8:21 ` Gene Cooperman
2010-11-22 18:02 ` Sukadev Bhattiprolu
2010-11-23 17:53 ` Oren Laadan
2010-11-24 3:50 ` Kapil Arya
2010-11-25 16:04 ` Oren Laadan
2010-11-29 4:09 ` Gene Cooperman
2010-11-21 22:41 ` Grant Likely
2010-11-22 17:34 ` Oren Laadan
2010-11-22 17:18 ` Oren Laadan
2010-11-17 22:17 ` Matt Helsley
2010-11-18 10:06 ` Tejun Heo
2010-11-18 20:25 ` Oren Laadan
2010-11-07 21:44 ` Oren Laadan
2010-11-07 23:31 ` Gene Cooperman
2010-11-05 22:24 ` Oren Laadan
2010-11-04 4:03 ` Oren Laadan
2010-11-04 9:43 ` Tejun Heo
2010-11-04 12:48 ` Luck, Tony
2010-11-04 13:06 ` Tejun Heo
2010-11-06 10:12 ` Matt Helsley
2010-11-06 11:03 ` Tejun Heo
2010-11-07 22:59 ` Davide Libenzi
2010-11-08 2:32 ` david
2010-11-18 20:41 ` Oren Laadan
2010-11-05 3:55 ` Kapil Arya
2010-11-05 11:57 ` Luck, Tony
2010-11-05 17:17 ` Gene Cooperman
2010-11-06 1:16 ` Matt Helsley
2010-11-06 4:06 ` Oren Laadan
2010-11-06 5:18 ` Matt Helsley
2010-11-06 21:00 ` Oren Laadan
2010-11-05 17:31 ` Sukadev Bhattiprolu
2010-11-06 21:05 ` Oren Laadan
2010-11-08 16:55 ` Grant Likely
2010-11-08 21:01 ` Nathan Lynch
2010-11-11 6:27 ` Nathan Lynch
2010-11-17 5:29 ` Anton Blanchard
2010-11-17 11:08 ` Tejun Heo
2010-11-18 9:53 ` Alan Cox
2010-11-18 12:27 ` Alexey Dobriyan
2010-11-19 6:33 ` Gene Cooperman
2010-11-21 23:20 ` Grant Likely
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4CD71A6B.3020905@cs.columbia.edu \
--to=orenl@cs.columbia.edu \
--cc=gene@ccs.neu.edu \
--cc=hch@lst.de \
--cc=kapil@ccs.neu.edu \
--cc=ksummit-2010-discuss@lists.linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox