From: Oren Laadan <orenl@cs.columbia.edu>
To: Mike Waychison <mikew@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-api@vger.kernel.org, containers@lists.linux-foundation.org,
hpa@zytor.com, linux-kernel@vger.kernel.org,
Dave Hansen <dave@linux.vnet.ibm.com>,
linux-mm@kvack.org, viro@zeniv.linux.org.uk, mingo@elte.hu,
mpm@selenic.com, tglx@linutronix.de,
Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>,
Alexey Dobriyan <adobriyan@gmail.com>,
xemul@openvz.org
Subject: Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
Date: Fri, 13 Mar 2009 18:35:58 -0400 [thread overview]
Message-ID: <49BADFCE.8020207@cs.columbia.edu> (raw)
In-Reply-To: <49BAC6AF.9090607@google.com>
Mike Waychison wrote:
> Linus Torvalds wrote:
>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>
>>> Ying Han [yinghan@google.com] wrote:
>>> | Hi Serge:
>>> | I made a patch based on Oren's tree recently which implement a new
>>> | syscall clone_with_pid. I tested with checkpoint/restart process tree
>>> | and it works as expected.
>>>
>>> Yes, I think we had a version of clone() with pid a while ago.
>> Are people _at_all_ thinking about security?
>>
>> Obviously not.
>>
>> There's no way we can do anything like this. Sure, it's trivial to do
>> inside the kernel. But it also sounds like a _wonderful_ attack vector
>> against badly written user-land software that sends signals and has small
>> races.
>
> I'm not really sure how this is different than a malicious app going off
> and spawning thousands of threads in an attempt to hit a target pid from
> a security pov. Sure, it makes it easier, but it's not like there is
> anything in place to close the attack vector.
>
>> Quite frankly, from having followed the discussion(s) over the last few
>> weeks about checkpoint/restart in various forms, my reaction to just about
>> _all_ of this is that people pushing this are pretty damn borderline.
>>
>> I think you guys are working on all the wrong problems.
>>
>> Let's face it, we're not going to _ever_ checkpoint any kind of general
>> case process. Just TCP makes that fundamentally impossible in the general
>> case, and there are lots and lots of other cases too (just something as
>> totally _trivial_ as all the files in the filesystem that don't get rolled
>> back).
>
> In some instances such as ours, TCP is probably the easiest thing to
> migrate. In an rpc-based cluster application, TCP is nothing more than
> an RPC channel and applications already have to handle RPC channel
> failure and re-establishment.
>
> I agree that this is not the 'general case' as you mention above
> however. This is the bit that sorta bothers me with the way the
> implementation has been going so far on this list. The implementation
> that folks are building on top of Oren's patchset tries to be everything
> to everybody. For our purposes, we need to have the flexibility of
> choosing *how* we checkpoint. The line seems to be arbitrarily drawn at
> the kernel being responsible for checkpointing and restoring all
> resources associated with a task, and leaving userland with nothing more
> than transporting filesystem bits. This approach isn't flexible enough:
> Consider the case where we want to stub out most of the TCP file
> descriptors with ECONNRESETed sockets because we know that they are RPC
> sockets and can re-establish themselves, but we want to use some other
> mechanism for TCP sockets we don't know much about. The current
> monolithic approach has zero flexibility for doing anything like this,
> and I figure out how we could even fit anything like this in.
The flexibility exists, but wasn't spelled out, so here it is:
1) Similar to madvice(), I envision a cradvice() that could tell the c/r
something about specific resources, e.g.:
* cradvice(CR_ADV_MEM, ptr, len) -> don't save that memory, it's scratch
* cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET) -> reset connection on restart
etc .. (nevermind the exact interface right now)
2) Tasks can ask to be notified (e.g. register a signal) when a checkpoint
or a restart complete successfully. At that time they can do their private
house-keeping if they know better.
3) If restoring some resource is significantly easier in user space (e.g. a
file-descriptor of some special device which user space knows how to
re-initialize), then the restarting task can prepare it ahead of time,
and, call:
* cradvice(CR_ADV_USERFD, fd, 0) -> use the fd in place instead of trying
to restore it yourself.
Method #3 is what I used in Zap to implement distributed checkpoints, where
it is so much easier to recreate all network connections in user space then
putting that logic into the kernel.
Now, on the other hand, doing the c/r from userland is much less flexible
than in the kernel (e.g. epollfd, futex state and much more) and requires
exposing tremendous amount of in-kernel data to user space. And we all know
than exposing internals is always a one-way ticket :(
[...]
Oren.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-03-13 22:35 UTC|newest]
Thread overview: 121+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
2009-01-27 17:07 ` [RFC v13][PATCH 01/14] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
2009-01-27 17:20 ` Randy Dunlap
2009-01-27 17:08 ` [RFC v13][PATCH 02/14] Checkpoint/restart: initial documentation Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 03/14] Make file_pos_read/write() public Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 04/14] General infrastructure for checkpoint restart Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 05/14] x86 support for checkpoint/restart Oren Laadan
2009-02-24 7:47 ` Nathan Lynch
[not found] ` <20090224014739.1b82fc35-4v5LP+xe+1byhTdZtsIeww@public.gmane.org>
2009-02-24 16:06 ` Dave Hansen
2009-03-18 7:21 ` Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 06/14] Dump memory address space Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 07/14] Restore " Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 08/14] Infrastructure for shared objects Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 09/14] Dump open file descriptors Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 11/14] External checkpoint of a task other than ourself Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 13/14] Checkpoint multiple processes Oren Laadan
[not found] ` <1233076092-8660-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-01-27 17:08 ` [RFC v13][PATCH 10/14] Restore open file descriprtors Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 12/14] Track in-kernel when we expect checkpoint/restart to work Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 14/14] Restart multiple processes Oren Laadan
2009-02-10 17:05 ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
2009-02-11 22:14 ` Andrew Morton
2009-02-12 9:17 ` Ingo Molnar
[not found] ` <20090212091721.GB1888-X9Un+BFzKDI@public.gmane.org>
2009-02-12 18:11 ` Dave Hansen
2009-02-12 20:48 ` Serge E. Hallyn
2009-02-13 10:20 ` Ingo Molnar
2009-02-12 18:11 ` Dave Hansen
2009-02-12 19:30 ` Matt Mackall
2009-02-12 19:42 ` Andrew Morton
2009-02-12 21:51 ` What can OpenVZ do? Dave Hansen
2009-02-12 22:10 ` Andrew Morton
2009-02-12 23:04 ` How much of a mess does OpenVZ make? ;) Was: " Dave Hansen
2009-02-26 15:57 ` Alexey Dobriyan
2009-03-10 21:53 ` Alexey Dobriyan
2009-03-10 23:28 ` Serge E. Hallyn
2009-03-11 8:26 ` Cedric Le Goater
2009-03-12 14:53 ` Serge E. Hallyn
2009-03-12 21:01 ` Greg Kurz
2009-03-12 21:21 ` Serge E. Hallyn
2009-03-13 4:29 ` Ying Han
2009-03-13 5:34 ` Sukadev Bhattiprolu
[not found] ` <20090313053458.GA28833-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-03-13 6:19 ` Ying Han
2009-03-13 17:27 ` Linus Torvalds
2009-03-13 19:02 ` Serge E. Hallyn
[not found] ` <alpine.LFD.2.00.0903131018390.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-03-13 19:35 ` Alexey Dobriyan
2009-03-13 21:01 ` Linus Torvalds
2009-03-13 21:51 ` Dave Hansen
2009-03-13 22:15 ` Oren Laadan
2009-03-14 0:27 ` Eric W. Biederman
2009-03-14 8:12 ` Ingo Molnar
2009-03-16 22:33 ` Kevin Fox
2009-03-19 21:19 ` Eric W. Biederman
[not found] ` <alpine.LFD.2.00.0903131401070.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-03-14 0:20 ` Alexey Dobriyan
2009-03-14 8:25 ` Ingo Molnar
[not found] ` <20090314082532.GB16436-X9Un+BFzKDI@public.gmane.org>
2009-03-14 17:11 ` Joseph Ruscio
2009-03-16 6:01 ` Oren Laadan
2009-03-13 20:48 ` Mike Waychison
2009-03-13 22:35 ` Oren Laadan [this message]
2009-03-18 18:54 ` Mike Waychison
2009-03-18 19:04 ` Oren Laadan
[not found] ` <604427e00903122129y37ad791aq5fe7ef2552415da9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-03-13 15:27 ` Cedric Le Goater
[not found] ` <49BA7B60.60607-GANU6spQydw@public.gmane.org>
2009-03-13 17:11 ` Greg Kurz
2009-03-13 17:37 ` Serge E. Hallyn
2009-03-13 15:47 ` Cedric Le Goater
2009-03-13 16:35 ` Serge E. Hallyn
2009-03-13 16:53 ` Cedric Le Goater
2009-02-26 16:27 ` Alexey Dobriyan
2009-02-26 17:33 ` Ingo Molnar
[not found] ` <20090226173302.GB29439-X9Un+BFzKDI@public.gmane.org>
2009-02-26 18:30 ` Greg Kurz
2009-02-26 22:17 ` Alexey Dobriyan
[not found] ` <20090226221709.GA2924-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-02-27 9:19 ` Greg Kurz
2009-02-27 10:53 ` Alexey Dobriyan
2009-02-27 14:33 ` Cedric Le Goater
2009-02-27 9:36 ` Cedric Le Goater
2009-02-26 22:31 ` Alexey Dobriyan
2009-02-27 9:03 ` Ingo Molnar
2009-02-27 9:19 ` Andrew Morton
2009-02-27 10:57 ` Alexey Dobriyan
[not found] ` <20090227090323.GC16211-X9Un+BFzKDI@public.gmane.org>
2009-02-27 9:22 ` Andrew Morton
2009-02-27 10:59 ` Alexey Dobriyan
2009-02-27 16:14 ` Dave Hansen
2009-02-27 21:57 ` Alexey Dobriyan
[not found] ` <20090227215749.GA3453-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-02-27 21:54 ` Dave Hansen
[not found] ` <20090226223112.GA2939-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-03-01 1:33 ` Alexey Dobriyan
[not found] ` <20090301013304.GA2428-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-03-01 20:02 ` Serge E. Hallyn
[not found] ` <20090301200231.GA25276-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-03-01 20:56 ` Alexey Dobriyan
2009-03-01 22:21 ` Serge E. Hallyn
2009-03-03 16:17 ` Cedric Le Goater
2009-03-03 18:28 ` Serge E. Hallyn
2009-02-13 10:53 ` Ingo Molnar
[not found] ` <20090213105302.GC4608-X9Un+BFzKDI@public.gmane.org>
2009-02-16 20:51 ` Dave Hansen
2009-02-17 22:23 ` Ingo Molnar
[not found] ` <20090217222319.GA10546-X9Un+BFzKDI@public.gmane.org>
2009-02-17 22:30 ` Dave Hansen
2009-02-18 0:32 ` Ingo Molnar
2009-02-18 0:40 ` Dave Hansen
2009-02-18 5:11 ` Alexey Dobriyan
2009-02-18 18:16 ` Ingo Molnar
[not found] ` <20090218181644.GD19995-X9Un+BFzKDI@public.gmane.org>
2009-02-18 21:27 ` Dave Hansen
2009-02-18 23:15 ` Ingo Molnar
2009-02-19 19:06 ` Banning checkpoint (was: Re: What can OpenVZ do?) Alexey Dobriyan
2009-02-19 19:11 ` Dave Hansen
2009-02-24 4:47 ` Alexey Dobriyan
[not found] ` <20090224044752.GB3202-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-02-24 5:11 ` Dave Hansen
2009-02-24 15:43 ` Serge E. Hallyn
2009-02-24 20:09 ` Alexey Dobriyan
2009-02-12 22:17 ` What can OpenVZ do? Alexey Dobriyan
2009-02-13 10:27 ` Ingo Molnar
2009-02-13 11:32 ` Alexey Dobriyan
2009-02-13 11:45 ` Ingo Molnar
2009-02-13 22:28 ` Alexey Dobriyan
2009-03-14 0:04 ` Eric W. Biederman
2009-03-14 0:26 ` Serge E. Hallyn
2009-02-12 22:57 ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
2009-02-12 23:05 ` Matt Mackall
2009-02-12 23:13 ` Dave Hansen
2009-02-13 23:28 ` Andrew Morton
2009-02-14 23:08 ` Ingo Molnar
2009-02-14 23:31 ` Andrew Morton
2009-02-14 23:50 ` Ingo Molnar
[not found] ` <20090213152836.0fbbfa7d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-02-16 17:37 ` Dave Hansen
2009-03-13 2:45 ` Oren Laadan
2009-03-13 3:57 ` Oren Laadan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49BADFCE.8020207@cs.columbia.edu \
--to=orenl@cs.columbia.edu \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=containers@lists.linux-foundation.org \
--cc=dave@linux.vnet.ibm.com \
--cc=hpa@zytor.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mikew@google.com \
--cc=mingo@elte.hu \
--cc=mpm@selenic.com \
--cc=sukadev@linux.vnet.ibm.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
--cc=xemul@openvz.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).