From: "Serge E. Hallyn" <serue@us.ibm.com>
To: Oren Laadan <orenl@cs.columbia.edu>
Cc: dave@linux.vnet.ibm.com, containers@lists.linux-foundation.org,
jeremy@goop.org, linux-kernel@vger.kernel.org, arnd@arndb.de
Subject: Re: [RFC v5][PATCH 6/8] Checkpoint/restart: initial documentation
Date: Mon, 15 Sep 2008 15:26:15 -0500 [thread overview]
Message-ID: <20080915202615.GA28683@us.ibm.com> (raw)
In-Reply-To: <1221347167-9956-7-git-send-email-orenl@cs.columbia.edu>
Quoting Oren Laadan (orenl@cs.columbia.edu):
> Covers application checkpoint/restart, overall design, interfaces
> and checkpoint image format.
>
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
This really should include your demo programs from your patch 0/9
announcement.
> ---
> Documentation/checkpoint.txt | 207 ++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 207 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/checkpoint.txt
>
> diff --git a/Documentation/checkpoint.txt b/Documentation/checkpoint.txt
> new file mode 100644
> index 0000000..6bf75ce
> --- /dev/null
> +++ b/Documentation/checkpoint.txt
> @@ -0,0 +1,207 @@
> +
> + === Checkpoint-Restart support in the Linux kernel ===
> +
> +Copyright (C) 2008 Oren Laadan
> +
> +Author: Oren Laadan <orenl@cs.columbia.edu>
> +
> +License: The GNU Free Documentation License, Version 1.2
> + (dual licensed under the GPL v2)
> +Reviewers:
> +
> +Application checkpoint/restart [CR] is the ability to save the state
> +of a running application so that it can later resume its execution
> +from the time at which it was checkpointed. An application can be
> +migrated by checkpointing it on one machine and restarting it on
> +another. CR can provide many potential benefits:
> +
> +* Failure recovery: by rolling back an to a previous checkpoint
> +
> +* Improved response time: by restarting applications from checkpoints
> + instead of from scratch.
> +
> +* Improved system utilization: by suspending long running CPU
> + intensive jobs and resuming them when load decreases.
> +
> +* Fault resilience: by migrating applications off of faulty hosts.
> +
> +* Dynamic load balancing: by migrating applications to less loaded
> + hosts.
> +
> +* Improved service availability and administration: by migrating
> + applications before host maintenance so that they continue to run
> + with minimal downtime
> +
> +* Time-travel: by taking periodic checkpoints and restarting from
> + any previous checkpoint.
> +
> +
> +=== Overall design
> +
> +Checkpoint and restart is done in the kernel as much as possible. The
> +kernel exports a relative opaque 'blob' of data to userspace which can
> +then be handed to the new kernel at restore time. The 'blob' contains
> +data and state of select portions of kernel structures such as VMAs
> +and mm_structs, as well as copies of the actual memory that the tasks
> +use. Any changes in this blob's format between kernel revisions can be
> +handled by an in-userspace conversion program. The approach is similar
> +to virtually all of the commercial CR products out there, as well as
> +the research project Zap.
> +
> +Two new system calls are introduced to provide CR: sys_checkpoint and
> +sys_restart. The checkpoint code basically serializes internal kernel
> +state and writes it out to a file descriptor, and the resulting image
> +is stream-able. More specifically, it consists of 5 steps:
> + 1. Pre-dump
> + 2. Freeze the container
> + 3. Dump
> + 4. Thaw (or kill) the container
> + 5. Post-dump
> +Steps 1 and 5 are an optimization to reduce application downtime:
> +"pre-dump" works before freezing the container, e.g. the pre-copy for
> +live migration, and "post-dump" works after the container resumes
> +execution, e.g. write-back the data to secondary storage.
> +
> +The restart code basically reads the saved kernel state and from a
> +file descriptor, and re-creates the tasks and the resources they need
> +to resume execution. The restart code is executed by each task that
> +is restored in a new container to reconstruct its own state.
> +
> +
> +=== Interfaces
> +
> +int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
> + Checkpoint a container whose init task is identified by pid, to the
> + file designated by fd. Flags will have future meaning (should be 0
> + for now).
> + Returns: a positive integer that identifies the checkpoint image
> + (for future reference in case it is kept in memory) upon success,
> + 0 if it returns from a restart, and -1 if an error occurs.
> +
> +int sys_restart(int crid, int fd, unsigned long flags);
> + Restart a container from a checkpoint image identified by crid, or
> + from the blob stored in the file designated by fd. Flags will have
> + future meaning (should be 0 for now).
> + Returns: 0 on success and -1 if an error occurs.
> +
> +Thus, if checkpoint is initiated by a process in the container, one
> +can use logic similar to fork():
> + ...
> + crid = checkpoint(...);
> + switch (crid) {
> + case -1:
> + perror("checkpoint failed");
> + break;
> + default:
> + fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
> + /* proceed with execution after checkpoint */
> + ...
> + break;
> + case 0:
> + fprintf(stderr, "returned after restart\n");
> + /* proceed with action required following a restart */
> + ...
> + break;
> + }
> + ...
> +And to initiate a restart, the process in an empty container can use
> +logic similar to execve():
> + ...
> + if (restart(crid, ...) < 0)
> + perror("restart failed");
> + /* only get here if restart failed */
> + ...
> +
> +
> +=== Checkpoint image format
> +
> +The checkpoint image format is composed of records consistings of a
> +pre-header that identifies its contents, followed by a payload. (The
> +idea here is to enable parallel checkpointing in the future in which
> +multiple threads interleave data from multiple processes into a single
> +stream).
> +
> +The pre-header is defined by "struct cr_hdr" as follows:
> +
> +struct cr_hdr {
> + __s16 type;
> + __s16 len;
> + __u32 id;
> +};
> +
> +Here, 'type' field identifies the type of the payload, 'len' tells its
> +length in bytes. The 'id' identifies the owner object instance. The
> +meaning of the 'id' field varies depending on the type. For example,
> +for type CR_HDR_MM, the 'id' identifies the task to which this MM
> +belongs. The payload also varies depending on the type, for instance,
> +the data describing a task_struct is given by a 'struct cr_hdr_task'
> +(type CR_HDR_TASK) and so on.
> +
> +The format of the memory dump is as follows: for each VMA, there is a
> +'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
> +name. Following comes the actual contents, in one or more chunk: each
> +chunk begins with a header that specifies how many pages it holds,
> +then a the virtual addresses of all the dumped pages in that chunk,
> +followed by the actual contents of all the dumped pages. A header with
> +zero number of pages marks the end of the contents for a particular
> +VMA. Then comes the next VMA and so on.
> +
> +To illustrate this, consider a single simple task with two VMAs: one
> +is file mapped with two dumped pages, and the other is anonymous with
> +three dumped pages. The checkpoint image will look like this:
> +
> +cr_hdr + cr_hdr_head
> +cr_hdr + cr_hdr_task
> + cr_hdr + cr_hdr_mm
> + cr_hdr + cr_hdr_vma + cr_hdr + string
> + cr_hdr_pgarr (nr_pages = 2)
> + addr1, addr2
> + page1, page2
> + cr_hdr_pgarr (nr_pages = 0)
> + cr_hdr + cr_hdr_vma
> + cr_hdr_pgarr (nr_pages = 3)
> + addr3, addr4, addr5
> + page3, page4, page5
> + cr_hdr_pgarr (nr_pages = 0)
> + cr_hdr + cr_mm_context
> + cr_hdr + cr_hdr_thread
> + cr_hdr + cr_hdr_cpu
> +cr_hdr + cr_hdr_tail
> +
> +
> +=== Changelog
> +
> +[2008-Sep-11] v5:
> + - Config is 'def_bool n' by default
> + - Improve memory dump/restore code (following Dave Hansen's comments)
> + - Change dump format (and code) to allow chunks of <vaddrs, pages>
> + instead of one long list of each
> + - Fix use of follow_page() to avoid faulting in non-present pages
> + - Memory restore now maps user pages explicitly to copy data into them,
> + instead of reading directly to user space; got rid of mprotect_fixup()
> + - Remove preempt_disable() when restoring debug registers
> + - Rename headers files s/ckpt/checkpoint/
> + - Fix misc bugs in files dump/restore
> + - Fix cleanup on some error paths
> + - Fix misc coding style
> +
> +[2008-Sep-04] v4:
> + - Fix calculation of hash table size
> + - Fix header structure alignment
> + - Use stand list_... for cr_pgarr
> +
> +[2008-Aug-20] v3:
> + - Various fixes and clean-ups
> + - Use standard hlist_... for hash table
> + - Better use of standard kmalloc/kfree
> +
> +[2008-Aug-09] v2:
> + - Added utsname->{release,version,machine} to checkpoint header
> + - Pad header structures to 64 bits to ensure compatibility
> + - Address comments from LKML and linux-containers mailing list
> +
> +[2008-Jul-29] v1:
> +In this incarnation, CR only works on single task. The address space
> +may consist of only private, simple VMAs - anonymous or file-mapped.
> +Both checkpoint and restart will ignore the first argument (pid/crid)
> +and instead act on themselves.
> --
> 1.5.4.3
>
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
next prev parent reply other threads:[~2008-09-15 21:33 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-09-13 23:05 [RFC v5][PATCH 0/9] Kernel based checkpoint/restart Oren Laadan
2008-09-13 23:06 ` [RFC v5][PATCH 2/8] General infrastructure for checkpoint restart Oren Laadan
2008-09-15 17:54 ` Dave Hansen
2008-09-15 17:59 ` Dave Hansen
2008-09-15 18:00 ` Dave Hansen
2008-09-15 18:02 ` Dave Hansen
2008-09-15 18:52 ` Oren Laadan
2008-09-15 18:52 ` Oren Laadan
2008-09-15 19:13 ` Dave Hansen
[not found] ` <48CEAEF2.1050901-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-15 19:13 ` Dave Hansen
2008-09-16 12:27 ` Bastian Blank
2008-09-16 12:27 ` Bastian Blank
[not found] ` <1221347167-9956-3-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-15 17:54 ` Dave Hansen
2008-09-15 17:59 ` Dave Hansen
2008-09-15 18:00 ` Dave Hansen
2008-09-15 18:02 ` Dave Hansen
2008-09-15 21:15 ` Serge E. Hallyn
2008-09-15 21:15 ` Serge E. Hallyn
2008-09-13 23:06 ` [RFC v5][PATCH 4/8] Dump memory address space Oren Laadan
2008-09-17 6:48 ` MinChan Kim
[not found] ` <1221347167-9956-5-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-17 6:48 ` MinChan Kim
[not found] ` <1221347167-9956-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-13 23:05 ` [RFC v5][PATCH 1/8] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
2008-09-13 23:05 ` Oren Laadan
[not found] ` <1221347167-9956-2-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-15 20:28 ` Serge E. Hallyn
2008-09-15 20:28 ` Serge E. Hallyn
2008-09-13 23:06 ` [RFC v5][PATCH 2/8] General infrastructure for checkpoint restart Oren Laadan
2008-09-13 23:06 ` [RFC v5][PATCH 3/8] x86 support for checkpoint/restart Oren Laadan
2008-09-13 23:06 ` Oren Laadan
[not found] ` <1221347167-9956-4-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-15 21:31 ` Serge E. Hallyn
2008-09-15 21:31 ` Serge E. Hallyn
2008-09-13 23:06 ` [RFC v5][PATCH 4/8] Dump memory address space Oren Laadan
2008-09-13 23:06 ` [RFC v5][PATCH 5/8] Restore " Oren Laadan
2008-09-13 23:06 ` Oren Laadan
[not found] ` <1221347167-9956-6-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-15 19:14 ` Dave Hansen
2008-09-15 19:14 ` Dave Hansen
2008-09-13 23:06 ` [RFC v5][PATCH 6/8] Checkpoint/restart: initial documentation Oren Laadan
2008-09-13 23:06 ` Oren Laadan
2008-09-15 20:26 ` Serge E. Hallyn [this message]
2008-09-17 6:23 ` MinChan Kim
[not found] ` <1221347167-9956-7-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-15 20:26 ` Serge E. Hallyn
2008-09-17 6:23 ` MinChan Kim
2008-09-13 23:06 ` [RFC v5][PATCH 7/8] Infrastructure for shared objects Oren Laadan
2008-09-13 23:06 ` Oren Laadan
2008-09-16 20:54 ` Serge E. Hallyn
[not found] ` <20080916205459.GA7644-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-09-16 21:36 ` Oren Laadan
2008-09-16 21:36 ` Oren Laadan
[not found] ` <48D026ED.3080109-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-16 22:09 ` Serge E. Hallyn
2008-09-16 22:09 ` Serge E. Hallyn
[not found] ` <1221347167-9956-8-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-16 16:48 ` Dave Hansen
2008-09-16 16:48 ` Dave Hansen
2008-09-17 7:31 ` MinChan Kim
2008-09-17 7:31 ` MinChan Kim
2008-09-16 20:54 ` Serge E. Hallyn
2008-09-13 23:06 ` [RFC v5][PATCH 8/8] Dump open file descriptors Oren Laadan
2008-09-13 23:06 ` Oren Laadan
2008-09-14 9:51 ` Bastian Blank
[not found] ` <20080914095106.GA6300-0IJIQSrh9RL9UF0aPl6fsj8Kkb2uy4ct@public.gmane.org>
2008-09-14 15:40 ` Oren Laadan
2008-09-14 15:40 ` Oren Laadan
2008-09-16 23:03 ` Serge E. Hallyn
2008-09-22 15:31 ` Dave Hansen
[not found] ` <20080916230320.GA25445-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-09-22 15:31 ` Dave Hansen
[not found] ` <48CD3069.7080200-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-16 23:03 ` Serge E. Hallyn
[not found] ` <1221347167-9956-9-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-14 9:51 ` Bastian Blank
2008-09-16 15:54 ` Dave Hansen
2008-09-16 16:55 ` Dave Hansen
2008-09-16 15:54 ` Dave Hansen
2008-09-16 16:55 ` Dave Hansen
2008-09-13 23:06 ` [RFC v5][PATCH 9/9] Restore open file descriprtors Oren Laadan
2008-09-13 23:22 ` Oren Laadan
2008-09-17 14:16 ` [RFC v5][PATCH 0/9] Kernel based checkpoint/restart Serge E. Hallyn
2008-09-24 21:42 ` Serge E. Hallyn
2008-09-13 23:06 ` [RFC v5][PATCH 9/9] Restore open file descriprtors Oren Laadan
[not found] ` <1221347167-9956-10-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-16 23:08 ` Serge E. Hallyn
2008-09-16 23:08 ` Serge E. Hallyn
[not found] ` <20080916230850.GB25445-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-09-17 0:11 ` Oren Laadan
2008-09-17 0:11 ` Oren Laadan
2008-09-17 4:56 ` Serge E. Hallyn
[not found] ` <48D04B19.9060502-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-09-17 4:56 ` Serge E. Hallyn
2008-09-22 16:02 ` Dave Hansen
2008-09-22 16:02 ` Dave Hansen
2008-09-13 23:22 ` Oren Laadan
2008-09-17 14:16 ` [RFC v5][PATCH 0/9] Kernel based checkpoint/restart Serge E. Hallyn
2008-10-08 9:59 ` Oren Laadan
[not found] ` <20080917141601.GA14010-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-10-08 9:59 ` Oren Laadan
2008-09-24 21:42 ` Serge E. Hallyn
2008-09-25 12:58 ` Cedric Le Goater
[not found] ` <20080924214242.GA27875-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-09-25 12:58 ` Cedric Le Goater
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080915202615.GA28683@us.ibm.com \
--to=serue@us.ibm.com \
--cc=arnd@arndb.de \
--cc=containers@lists.linux-foundation.org \
--cc=dave@linux.vnet.ibm.com \
--cc=jeremy@goop.org \
--cc=linux-kernel@vger.kernel.org \
--cc=orenl@cs.columbia.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.