From: Dave Hansen <dave@linux.vnet.ibm.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: Andrew Morton <akpm@linux-foundation.org>,
orenl@cs.columbia.edu, linux-api@vger.kernel.org,
containers@lists.linux-foundation.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
torvalds@linux-foundation.org, viro@zeniv.linux.org.uk,
hpa@zytor.com, Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
Date: Thu, 12 Feb 2009 10:11:22 -0800 [thread overview]
Message-ID: <1234462282.30155.171.camel@nimitz> (raw)
In-Reply-To: <20090211141434.dfa1d079.akpm@linux-foundation.org>
On Wed, 2009-02-11 at 14:14 -0800, Andrew Morton wrote:
> On Tue, 10 Feb 2009 09:05:47 -0800
> Dave Hansen <dave@linux.vnet.ibm.com> wrote:
>
> > On Tue, 2009-01-27 at 12:07 -0500, Oren Laadan wrote:
> > > Checkpoint-restart (c/r): a couple of fixes in preparation for 64bit
> > > architectures, and a couple of fixes for bugss (comments from Serge
> > > Hallyn, Sudakvev Bhattiprolu and Nathan Lynch). Updated and tested
> > > against v2.6.28.
> > >
> > > Aiming for -mm.
> >
> > Is there anything that we're waiting on before these can go into -mm? I
> > think the discussion on the first few patches has died down to almost
> > nothing. They're pretty reviewed-out. Do they need a run in -mm? I
> > don't think linux-next is quite appropriate since they're not _quite_
> > aimed at mainline yet.
> >
>
> I raised an issue a few months ago and got inconclusively waffled at.
> Let us revisit.
>
> I am concerned that this implementation is a bit of a toy, and that we
> don't know what a sufficiently complete implementation will look like.
> There is a risk that if we merge the toy we either:
>
> a) end up having to merge unacceptably-expensive-to-maintain code to
> make it a non-toy or
>
> b) decide not to merge the unacceptably-expensive-to-maintain code,
> leaving us with a toy or
>
> c) simply cannot work out how to implement the missing functionality.
>
>
> So perhaps we can proceed by getting you guys to fill out the following
> paperwork:
>
> - In bullet-point form, what features are present?
* i386 arch is supported
* processes can perform a "self-checkpoint" which means calling
sys_checkpoint() on itself as well as "external checkpoint" where
one task checkpoints another.
* supported fds:
* "normal" files on the filesystem
* both endpoints of a pipe are checkpointed, as are pipe contents
* each process's memory map is saved
* the contents of anonymous memory are saved
* infrastructure for managing objects in the checkpoint which are
"shared" by multiple users like fds or a SVSV semaphore, for instance
* multiple processes may be checkpointed during a single sys_checkpoint()
> - In bullet-point form, what features are missing, and should be added?
* support for more architectures than i386
* file descriptors:
* sockets (network, AF_UNIX, etc...)
* devices files
* shmfs, hugetlbfs
* epoll
* unlinked files
* Filesystem state
* contents of files
* mount tree for individual processes
* flock
* threads and sessions
* CPU and NUMA affinity
* sys_remap_file_pages()
This is a very minimal list that is surely incomplete and sure to grow.
I think of it like kernel scalability. Is scalability important? Do we
want the whole kernel to scale? Yes, and yes, of course. *Does* every
single device and feature in the kernel scale? No way. Will it ever be
"done"? No freakin' way! But, the kernel is scalable on the workloads
that are important to people.
Checkpoint/restart is the same way. We intend to make core kernel
functionality checkpointable first. We'll move outwards from there as
we (and our users) deem things important, but we'll certainly never be
done.
> - Is it possible to briefly sketch out the design of the to-be-added
> features?
For architecture (and indeed processor variation) we need a look at how
and when its registers are saved on kernel entry as well as things like
32/64-bit processes and mm_context considerations. There is x86_64,
s390 and ppc work ongoing. Those ports have required quite small
changes in the generic code, which is a good sign.
Each fd type will need to be worked on separately. Device files will
generally have to be one-off. /dev/null has no internal state at all.
But, work needs done for devices which may have had all kinds of
ioctl()s done on them.
Unlinked files will need their contents stored in the checkpoint so that
they may be copied over during restart (say to a temporary file),
opened, and unlinked again. Files on kernel-internal mounts will need
similar treatment (think 'pipe_mnt').
We expect the filesystem *contents* to be taken care of generally by
outside mechanisms like dm or btrfs snapshotting.
For the filesystem namespace, we'll effectively need to export what we
already have in /proc/$pid/mountinfo.
I'm going to punt on explaining the networking bits for now because I
think I'd be wasting your time. There are a couple of other guys around
much more versed in that area.
> For extra marks:
>
> - Will any of this involve non-trivial serialisation of kernel
> objects? If so, that's getting into the
> unacceptably-expensive-to-maintain space, I suspect.
We have some structures that are certainly tied to the kernel-internal
ones. However, we are certainly *not* simply writing kernel structures
to userspace. We could do that with /dev/mem. We are carefully pulling
out the minimal bits of information from the kernel structures that we
*need* to recreate the function of the structure at restart. There is a
maintenance burden here but, so far, that burden is almost entirely in
checkpoint/*.c. We intend to test this functionality thoroughly to
ensure that we don't regress once we have integrated it.
> - Does (or will) this feature also support process migration? If
> not, I'd have thought this to be a showstopper.
You mean moving processes between machines? Yes, it certainly will.
That is one of the primary design goals.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-02-12 18:11 UTC|newest]
Thread overview: 121+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
2009-01-27 17:07 ` [RFC v13][PATCH 01/14] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
2009-01-27 17:20 ` Randy Dunlap
2009-01-27 17:08 ` [RFC v13][PATCH 02/14] Checkpoint/restart: initial documentation Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 03/14] Make file_pos_read/write() public Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 04/14] General infrastructure for checkpoint restart Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 05/14] x86 support for checkpoint/restart Oren Laadan
2009-02-24 7:47 ` Nathan Lynch
[not found] ` <20090224014739.1b82fc35-4v5LP+xe+1byhTdZtsIeww@public.gmane.org>
2009-02-24 16:06 ` Dave Hansen
2009-03-18 7:21 ` Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 06/14] Dump memory address space Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 07/14] Restore " Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 08/14] Infrastructure for shared objects Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 09/14] Dump open file descriptors Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 11/14] External checkpoint of a task other than ourself Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 13/14] Checkpoint multiple processes Oren Laadan
[not found] ` <1233076092-8660-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-01-27 17:08 ` [RFC v13][PATCH 10/14] Restore open file descriprtors Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 12/14] Track in-kernel when we expect checkpoint/restart to work Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 14/14] Restart multiple processes Oren Laadan
2009-02-10 17:05 ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
2009-02-11 22:14 ` Andrew Morton
2009-02-12 9:17 ` Ingo Molnar
[not found] ` <20090212091721.GB1888-X9Un+BFzKDI@public.gmane.org>
2009-02-12 18:11 ` Dave Hansen
2009-02-12 20:48 ` Serge E. Hallyn
2009-02-13 10:20 ` Ingo Molnar
2009-02-12 18:11 ` Dave Hansen [this message]
2009-02-12 19:30 ` Matt Mackall
2009-02-12 19:42 ` Andrew Morton
2009-02-12 21:51 ` What can OpenVZ do? Dave Hansen
2009-02-12 22:10 ` Andrew Morton
2009-02-12 23:04 ` How much of a mess does OpenVZ make? ;) Was: " Dave Hansen
2009-02-26 15:57 ` Alexey Dobriyan
2009-03-10 21:53 ` Alexey Dobriyan
2009-03-10 23:28 ` Serge E. Hallyn
2009-03-11 8:26 ` Cedric Le Goater
2009-03-12 14:53 ` Serge E. Hallyn
2009-03-12 21:01 ` Greg Kurz
2009-03-12 21:21 ` Serge E. Hallyn
2009-03-13 4:29 ` Ying Han
2009-03-13 5:34 ` Sukadev Bhattiprolu
[not found] ` <20090313053458.GA28833-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-03-13 6:19 ` Ying Han
2009-03-13 17:27 ` Linus Torvalds
2009-03-13 19:02 ` Serge E. Hallyn
[not found] ` <alpine.LFD.2.00.0903131018390.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-03-13 19:35 ` Alexey Dobriyan
2009-03-13 21:01 ` Linus Torvalds
2009-03-13 21:51 ` Dave Hansen
2009-03-13 22:15 ` Oren Laadan
2009-03-14 0:27 ` Eric W. Biederman
2009-03-14 8:12 ` Ingo Molnar
2009-03-16 22:33 ` Kevin Fox
2009-03-19 21:19 ` Eric W. Biederman
[not found] ` <alpine.LFD.2.00.0903131401070.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-03-14 0:20 ` Alexey Dobriyan
2009-03-14 8:25 ` Ingo Molnar
[not found] ` <20090314082532.GB16436-X9Un+BFzKDI@public.gmane.org>
2009-03-14 17:11 ` Joseph Ruscio
2009-03-16 6:01 ` Oren Laadan
2009-03-13 20:48 ` Mike Waychison
2009-03-13 22:35 ` Oren Laadan
2009-03-18 18:54 ` Mike Waychison
2009-03-18 19:04 ` Oren Laadan
[not found] ` <604427e00903122129y37ad791aq5fe7ef2552415da9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-03-13 15:27 ` Cedric Le Goater
[not found] ` <49BA7B60.60607-GANU6spQydw@public.gmane.org>
2009-03-13 17:11 ` Greg Kurz
2009-03-13 17:37 ` Serge E. Hallyn
2009-03-13 15:47 ` Cedric Le Goater
2009-03-13 16:35 ` Serge E. Hallyn
2009-03-13 16:53 ` Cedric Le Goater
2009-02-26 16:27 ` Alexey Dobriyan
2009-02-26 17:33 ` Ingo Molnar
[not found] ` <20090226173302.GB29439-X9Un+BFzKDI@public.gmane.org>
2009-02-26 18:30 ` Greg Kurz
2009-02-26 22:17 ` Alexey Dobriyan
[not found] ` <20090226221709.GA2924-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-02-27 9:19 ` Greg Kurz
2009-02-27 10:53 ` Alexey Dobriyan
2009-02-27 14:33 ` Cedric Le Goater
2009-02-27 9:36 ` Cedric Le Goater
2009-02-26 22:31 ` Alexey Dobriyan
2009-02-27 9:03 ` Ingo Molnar
2009-02-27 9:19 ` Andrew Morton
2009-02-27 10:57 ` Alexey Dobriyan
[not found] ` <20090227090323.GC16211-X9Un+BFzKDI@public.gmane.org>
2009-02-27 9:22 ` Andrew Morton
2009-02-27 10:59 ` Alexey Dobriyan
2009-02-27 16:14 ` Dave Hansen
2009-02-27 21:57 ` Alexey Dobriyan
[not found] ` <20090227215749.GA3453-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-02-27 21:54 ` Dave Hansen
[not found] ` <20090226223112.GA2939-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-03-01 1:33 ` Alexey Dobriyan
[not found] ` <20090301013304.GA2428-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-03-01 20:02 ` Serge E. Hallyn
[not found] ` <20090301200231.GA25276-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-03-01 20:56 ` Alexey Dobriyan
2009-03-01 22:21 ` Serge E. Hallyn
2009-03-03 16:17 ` Cedric Le Goater
2009-03-03 18:28 ` Serge E. Hallyn
2009-02-13 10:53 ` Ingo Molnar
[not found] ` <20090213105302.GC4608-X9Un+BFzKDI@public.gmane.org>
2009-02-16 20:51 ` Dave Hansen
2009-02-17 22:23 ` Ingo Molnar
[not found] ` <20090217222319.GA10546-X9Un+BFzKDI@public.gmane.org>
2009-02-17 22:30 ` Dave Hansen
2009-02-18 0:32 ` Ingo Molnar
2009-02-18 0:40 ` Dave Hansen
2009-02-18 5:11 ` Alexey Dobriyan
2009-02-18 18:16 ` Ingo Molnar
[not found] ` <20090218181644.GD19995-X9Un+BFzKDI@public.gmane.org>
2009-02-18 21:27 ` Dave Hansen
2009-02-18 23:15 ` Ingo Molnar
2009-02-19 19:06 ` Banning checkpoint (was: Re: What can OpenVZ do?) Alexey Dobriyan
2009-02-19 19:11 ` Dave Hansen
2009-02-24 4:47 ` Alexey Dobriyan
[not found] ` <20090224044752.GB3202-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-02-24 5:11 ` Dave Hansen
2009-02-24 15:43 ` Serge E. Hallyn
2009-02-24 20:09 ` Alexey Dobriyan
2009-02-12 22:17 ` What can OpenVZ do? Alexey Dobriyan
2009-02-13 10:27 ` Ingo Molnar
2009-02-13 11:32 ` Alexey Dobriyan
2009-02-13 11:45 ` Ingo Molnar
2009-02-13 22:28 ` Alexey Dobriyan
2009-03-14 0:04 ` Eric W. Biederman
2009-03-14 0:26 ` Serge E. Hallyn
2009-02-12 22:57 ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
2009-02-12 23:05 ` Matt Mackall
2009-02-12 23:13 ` Dave Hansen
2009-02-13 23:28 ` Andrew Morton
2009-02-14 23:08 ` Ingo Molnar
2009-02-14 23:31 ` Andrew Morton
2009-02-14 23:50 ` Ingo Molnar
[not found] ` <20090213152836.0fbbfa7d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-02-16 17:37 ` Dave Hansen
2009-03-13 2:45 ` Oren Laadan
2009-03-13 3:57 ` Oren Laadan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1234462282.30155.171.camel@nimitz \
--to=dave@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=containers@lists.linux-foundation.org \
--cc=hpa@zytor.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=orenl@cs.columbia.edu \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).