* [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
@ 2009-01-27 17:07 Oren Laadan
  2009-01-27 17:07 ` [RFC v13][PATCH 01/14] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
                   ` (12 more replies)
  0 siblings, 13 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Checkpoint-restart (c/r): a couple of fixes in preparation for 64bit
architectures, and a couple of fixes for bugss (comments from Serge
Hallyn, Sudakvev Bhattiprolu and Nathan Lynch). Updated and tested
against v2.6.28.
Aiming for -mm.
The git tree tracking v13, branch 'ckpt-v13' (and older versions):
	git://git.ncl.cs.columbia.edu/pub/git/linux-cr.git
Restarting multiple processes requires 'mktree' userspace tool:
	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git
Oren.
--
Why do we want it?  It allows containers to be moved between physical
machines' kernels in the same way that VMWare can move VMs between
physical machines' hypervisors.  There are currently at least two
out-of-tree implementations of this in the commercial world (IBM's
Metacluster and Parallels' OpenVZ/Virtuozzo) and several in the academic
world like Zap.
Why do we need it in mainline now?  Because we already have plenty of
out-of-tree ones, and  want to know what an in-tree one will be like.   :)  
What *I* want right now is the extra review and scrutiny that comes with
a mainline submission to make sure we're not going in a direction
contrary to the community.
This only supports pretty simple apps.  But, I trust Ingo when he says:
>> > > Generally, if something works for simple apps already (in a robust, 
>> > > compatible and supportable way) and users find it "very cool", then 
>> > > support for more complex apps is not far in the future.  but if you
>> > > want to support more complex apps straight away, it takes forever and
>> > > gets ugly.
We're *certainly* going to be changing the ABI (which is the format of
the checkpoint).  I'd like to follow the model that we used for
ext4-dev, which is to make it very clear that this is a development-only
feature for now.  Perhaps we do that by making the interface only
available through debugfs or something similar for now.  Or, reserving
the syscall numbers but require some runtime switch to be thrown before
they can be used.  I'm open to suggestions here.
--
--
Todo:
- Add support for x86-64 and improve ABI
- Refine or change syscall interface
- Handle multiple namespaces in a container (e.g. save the filesystem
  namespaces state with the file descriptors)
- Security (without CAPS_SYS_ADMIN files restore may fail)
Changelog:
[2008-Dec-17] v12:
  - Fix re-alloc/reset of pgarr chain to correctly reuse buffers
    (empty pgarr are saves in a separate pool chain)
  - Add a couple of missed calls to cr_hbuf_put()
  - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
  - Split cr_write/cr_read() to two parts: _cr_write/read() helper
  - Befriend with sparse: explicit conversion to 'void __user *'
  - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug()
[2008-Dec-05] v11:
  - Use contents of 'init->fs->root' instead of pointing to it
  - Ignore symlinks (there is no such thing as an open symlink)
  - cr_scan_fds() retries from scratch if it hits size limits
  - Add missing test for VM_MAYSHARE when dumping memory
  - Improve documentation about: behavior when tasks aren't fronen,
    life span of the object hash, references to objects in the hash
 
[2008-Nov-26] v10:
  - Grab vfs root of container init, rather than current process
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
  - Force end-of-string in cr_read_string() (fix possible DoS)
  - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
[2008-Nov-10] v9:
  - Support multiple processes c/r
  - Extend checkpoint header with archtiecture dependent header 
  - Misc bug fixes (see individual changelogs)
  - Rebase to v2.6.28-rc3.
[2008-Oct-29] v8:
  - Support "external" checkpoint
  - Include Dave Hansen's 'deny-checkpoint' patch
  - Split docs in Documentation/checkpoint/..., and improve contents
[2008-Oct-17] v7:
  - Fix save/restore state of FPU
  - Fix argument given to kunmap_atomic() in memory dump/restore
[2008-Oct-07] v6:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
  - Add assumptions and what's-missing to documentation
  - Misc fixes and cleanups
[2008-Sep-11] v5:
  - Config is now 'def_bool n' by default
  - Improve memory dump/restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
  - Remove preempt_disable() when restoring debug registers
  - Rename headers files s/ckpt/checkpoint/
  - Fix misc bugs in files dump/restore
  - Fixes and cleanups on some error paths
  - Fix misc coding style
[2008-Sep-09] v4:
  - Various fixes and clean-ups
  - Fix calculation of hash table size
  - Fix header structure alignment
  - Use stand list_... for cr_pgarr
[2008-Aug-29] v3:
  - Various fixes and clean-ups
  - Use standard hlist_... for hash table
  - Better use of standard kmalloc/kfree
[2008-Aug-20] v2:
  - Added Dump and restore of open files (regular and directories)
  - Added basic handling of shared objects, and improve handling of
    'parent tag' concept
  - Added documentation
  - Improved ABI, 64bit padding for image data
  - Improved locking when saving/restoring memory
  - Added UTS information to header (release, version, machine)
  - Cleanup extraction of filename from a file pointer
  - Refactor to allow easier reviewing
  - Remove requirement for CAPS_SYS_ADMIN until we come up with a
    security policy (this means that file restore may fail)
  - Other cleanup and response to comments for v1
[2008-Jul-29] v1:
  - Initial version: support a single task with address space of only
    private anonymous or file-mapped VMAs; syscalls ignore pid/crid
    argument and act on current process.
--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.
This is different than what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.
The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.
This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.
These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.
In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped. The open files may consist
of only simple files and directories.
--
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 01/14] Create syscalls: sys_checkpoint, sys_restart
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
@ 2009-01-27 17:07 ` Oren Laadan
  2009-01-27 17:20   ` Randy Dunlap
  2009-01-27 17:08 ` [RFC v13][PATCH 02/14] Checkpoint/restart: initial documentation Oren Laadan
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.
The syscalls take a file descriptor (for the image file) and flags as
arguments. For sys_checkpoint the first argument identifies the target
container; for sys_restart it will identify the checkpoint image.
A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.
By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.
We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart.  Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic.  They also would significantly complicate
checkpoint that includes self.
Changelog[v5]:
  - Config is 'def_bool n' by default
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/unistd_32.h   |    2 +
 arch/x86/kernel/syscall_table_32.S |    2 +
 checkpoint/Kconfig                 |   11 +++++++++
 checkpoint/Makefile                |    5 ++++
 checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h           |    2 +
 init/Kconfig                       |    2 +
 kernel/sys_ni.c                    |    4 +++
 8 files changed, 69 insertions(+), 0 deletions(-)
 create mode 100644 checkpoint/Kconfig
 create mode 100644 checkpoint/Makefile
 create mode 100644 checkpoint/sys.c
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..a5f9e09 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,8 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restart		334
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/checkpoint/Kconfig b/checkpoint/Kconfig
new file mode 100644
index 0000000..ffaa635
--- /dev/null
+++ b/checkpoint/Kconfig
@@ -0,0 +1,11 @@
+config CHECKPOINT_RESTART
+	prompt "Enable checkpoint/restart (EXPERIMENTAL)"
+	def_bool n
+	depends on X86_32 && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
new file mode 100644
index 0000000..07d018b
--- /dev/null
+++ b/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
new file mode 100644
index 0000000..375129c
--- /dev/null
+++ b/checkpoint/sys.c
@@ -0,0 +1,41 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
+{
+	pr_debug("sys_checkpoint not implemented yet\n");
+	return -ENOSYS;
+}
+/**
+ * sys_restart - restart a container
+ * @crid: checkpoint image identifier
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
+{
+	pr_debug("sys_restart not implemented yet\n");
+	return -ENOSYS;
+}
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 04fb47b..9750393 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -621,6 +621,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index f763762..57364fe 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -814,6 +814,8 @@ config MARKERS
 
 source "arch/Kconfig"
 
+source "checkpoint/Kconfig"
+
 endmenu		# General setup
 
 config HAVE_GENERIC_DMA_COHERENT
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index e14a232..fcd65cc 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,3 +174,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 02/14] Checkpoint/restart: initial documentation
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
  2009-01-27 17:07 ` [RFC v13][PATCH 01/14] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2009-01-27 17:08 ` Oren Laadan
  2009-01-27 17:08 ` [RFC v13][PATCH 03/14] Make file_pos_read/write() public Oren Laadan
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Covers application checkpoint/restart, overall design, interfaces,
usage, shared objects, and and checkpoint image format.
Changelog[v8]:
  - Split into multiple files in Documentation/checkpoint/...
  - Extend documentation, fix typos and comments from feedback
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 Documentation/checkpoint/ckpt.c        |   32 ++++++
 Documentation/checkpoint/internals.txt |  133 +++++++++++++++++++++++++
 Documentation/checkpoint/readme.txt    |  105 +++++++++++++++++++
 Documentation/checkpoint/rstr.c        |   20 ++++
 Documentation/checkpoint/security.txt  |   38 +++++++
 Documentation/checkpoint/self.c        |   57 +++++++++++
 Documentation/checkpoint/test.c        |   48 +++++++++
 Documentation/checkpoint/usage.txt     |  171 ++++++++++++++++++++++++++++++++
 8 files changed, 604 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/checkpoint/ckpt.c
 create mode 100644 Documentation/checkpoint/internals.txt
 create mode 100644 Documentation/checkpoint/readme.txt
 create mode 100644 Documentation/checkpoint/rstr.c
 create mode 100644 Documentation/checkpoint/security.txt
 create mode 100644 Documentation/checkpoint/self.c
 create mode 100644 Documentation/checkpoint/test.c
 create mode 100644 Documentation/checkpoint/usage.txt
diff --git a/Documentation/checkpoint/ckpt.c b/Documentation/checkpoint/ckpt.c
new file mode 100644
index 0000000..094408c
--- /dev/null
+++ b/Documentation/checkpoint/ckpt.c
@@ -0,0 +1,32 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid;
+	int ret;
+
+	if (argc != 2) {
+		printf("usage: ckpt PID\n");
+		exit(1);
+	}
+
+	pid = atoi(argv[1]);
+	if (pid <= 0) {
+		printf("invalid pid\n");
+		exit(1);
+	}
+
+	ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+
+	if (ret < 0)
+		perror("checkpoint");
+	else
+		printf("checkpoint id %d\n", ret);
+
+	return (ret > 0 ? 0 : 1);
+}
+
diff --git a/Documentation/checkpoint/internals.txt b/Documentation/checkpoint/internals.txt
new file mode 100644
index 0000000..b363e83
--- /dev/null
+++ b/Documentation/checkpoint/internals.txt
@@ -0,0 +1,133 @@
+
+	===== Internals of Checkpoint-Restart =====
+
+
+(1) Order of state dump
+
+The order of operations, both save and restore, is as follows:
+
+* Header section: header, container information, etc.
+
+* Global section: [TBD] global resources such as IPC, UTS, etc.
+
+* Process forest: [TBD] tasks and their relationships
+
+* Per task data (for each task):
+  -> task state: elements of task_struct
+  -> thread state: elements of thread_struct and thread_info
+  -> CPU state: registers etc, including FPU
+  -> memory state: memory address space layout and contents
+  -> filesystem state: [TBD] filesystem namespace state, chroot, cwd, etc
+  -> files state: open file descriptors and their state
+  -> signals state: [TBD] pending signals and signal handling state
+  -> credentials state: [TBD] user and group state, statistics
+
+
+(2) Checkpoint image format
+
+The checkpoint image format is composed of records consisting of a
+pre-header that identifies its contents, followed by a payload. (The
+idea here is to enable parallel checkpointing in the future in which
+multiple threads interleave data from multiple processes into a single
+stream).
+
+The pre-header is defined by "struct cr_hdr" as follows:
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+'type' identifies the type of the payload, 'len' tells its length in
+bytes, and 'parent' identifies the owner object instance. The meaning
+of 'parent' varies depending on the type. For example, for CR_HDR_MM,
+'parent' identifies the task to which this MM belongs. The payload
+also varies depending on the type, for instance, the data describing a
+task_struct is given by a 'struct cr_hdr_task' (type CR_HDR_TASK) and
+so on.
+
+The format of the memory dump is as follows: for each VMA, there is a
+'struct cr_vma'; if the VMA is file-mapped, it is followed by the file
+name. Following comes the actual contents, in one or more chunks: each
+chunk begins with a header that specifies how many pages it holds,
+then the virtual addresses of all the dumped pages in that chunk,
+followed by the actual contents of all the dumped pages. A header with
+zero number of pages marks the end of the contents for a particular
+VMA. Then comes the next VMA and so on.
+
+To illustrate this, consider a single simple task with two VMAs: one
+is file mapped with two dumped pages, and the other is anonymous with
+three dumped pages. The checkpoint image will look like this:
+
+cr_hdr + cr_hdr_head
+cr_hdr + cr_hdr_task
+	cr_hdr + cr_hdr_mm
+		cr_hdr + cr_hdr_vma + cr_hdr + string
+			cr_hdr_pgarr (nr_pages = 2)
+			addr1, addr2
+			page1, page2
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_hdr_vma
+			cr_hdr_pgarr (nr_pages = 3)
+			addr3, addr4, addr5
+			page3, page4, page5
+			cr_hdr_pgarr (nr_pages = 0)
+		cr_hdr + cr_mm_context
+	cr_hdr + cr_hdr_thread
+	cr_hdr + cr_hdr_cpu
+cr_hdr + cr_hdr_tail
+
+
+(3) Shared resources (objects)
+
+Many resources used by tasks may be shared by more than one task (e.g.
+file descriptors, memory address space, etc), or even have multiple
+references from other resources (e.g. a single inode that represents
+two ends of a pipe).
+
+Clearly, the state of shared objects need only be saved once, even if
+they occur multiple times. We use a hash table (ctx->objhash) to keep
+track of shared objects and whether they were already saved.  Shared
+objects are stored in a hash table as they appear, indexed by their
+kernel address. (The hash table itself is not saved as part of the
+checkpoint image: it is constructed dynamically during both checkpoint
+and restart, and discarded at the end of the operation).
+
+Each shared object that is found is first looked up in the hash table.
+On the first encounter, the object will not be found, so its state is
+dumped, and the object is assigned a unique identifier and also stored
+in the hash table. Subsequent lookups of that object in the hash table
+will yield that entry, and then only the unique identifier is saved,
+as opposed the entire state of the object.
+
+During restart, shared objects are seen by their unique identifiers as
+assigned during the checkpoint. Each shared object that it read in is
+first looked up in the hash table. On the first encounter it will not
+be found, meaning that the object needs to be created and its state
+read in and restored. Then the object is added to the hash table, this
+time indexed by its unique identifier. Subsequent lookups of the same
+unique identifier in the hash table will yield that entry, and then
+the existing object instance is reused instead of creating another one.
+
+The hash grabs a reference to each object that is inserted, and
+maintains this reference for the entire lifetime of the hash. Thus,
+it is always safe to reference an object that is stored in the hash.
+The hash is "one-way" in the sense that objects that are added are
+never deleted from the hash until the hash is discarded. This, in
+turn, happens only when the checkpoint (or restart) terminates.
+
+The interface for the hash table is the following:
+
+cr_obj_get_by_ptr() - find the unique object reference (objref)
+  of the object that is pointer to by ptr [checkpoint]
+
+cr_obj_add_ptr() - add the object pointed to by ptr to the hash table
+  if not already there, and fill its unique object reference (objref)
+
+cr_obj_get_by_ref() - return the pointer to the object whose unique
+  object reference is equal to objref [restart]
+
+cr_obj_add_ref() - add the object with given unique object reference
+  (objref), pointed to by ptr to the hash table. [restart]
+
diff --git a/Documentation/checkpoint/readme.txt b/Documentation/checkpoint/readme.txt
new file mode 100644
index 0000000..344a551
--- /dev/null
+++ b/Documentation/checkpoint/readme.txt
@@ -0,0 +1,105 @@
+
+	===== Checkpoint-Restart support in the Linux kernel =====
+
+Copyright (C) 2008 Oren Laadan
+
+Author:		Oren Laadan <orenl@cs.columbia.edu>
+
+License:	The GNU Free Documentation License, Version 1.2
+		(dual licensed under the GPL v2)
+
+Reviewers:	Serge Hallyn <serue@us.ibm.com>
+		Dave Hansen <dave@linux.vnet.ibm.com>
+
+Application checkpoint/restart [C/R] is the ability to save the state
+of a running application so that it can later resume its execution
+from the time at which it was checkpointed. An application can be
+migrated by checkpointing it on one machine and restarting it on
+another. C/R can provide many potential benefits:
+
+* Failure recovery: by rolling back to a previous checkpoint
+
+* Improved response time: by restarting applications from checkpoints
+  instead of from scratch.
+
+* Improved system utilization: by suspending long running CPU
+  intensive jobs and resuming them when load decreases.
+
+* Fault resilience: by migrating applications off faulty hosts.
+
+* Dynamic load balancing: by migrating applications to less loaded
+  hosts.
+
+* Improved service availability and administration: by migrating
+  applications before host maintenance so that they continue to run
+  with minimal downtime
+
+* Time-travel: by taking periodic checkpoints and restarting from
+  any previous checkpoint.
+
+
+=== Overall design
+
+Checkpoint and restart is done in the kernel as much as possible. The
+kernel exports a relatively opaque 'blob' of data to userspace which can
+then be handed to the new kernel at restore time.  The 'blob' contains
+data and state of select portions of kernel structures such as VMAs
+and mm_structs, as well as copies of the actual memory that the tasks
+use. Any changes in this blob's format between kernel revisions can be
+handled by an in-userspace conversion program. The approach is similar
+to virtually all of the commercial C/R products out there, as well as
+the research project Zap.
+
+Two new system calls are introduced to provide C/R: sys_checkpoint()
+and sys_restart(). The checkpoint code basically serializes internal
+kernel state and writes it out to a file descriptor, and the resulting
+image is stream-able. More specifically, it consists of 5 steps:
+
+1. Pre-dump
+2. Freeze the container
+3. Dump
+4. Thaw (or kill) the container
+5. Post-dump
+
+Steps 1 and 5 are an optimization to reduce application downtime. In
+particular, "pre-dump" works before freezing the container, e.g. the
+pre-copy for live migration, and "post-dump" works after the container
+resumes execution, e.g. write-back the data to secondary storage.
+
+The restart code basically reads the saved kernel state from a file
+descriptor, and re-creates the tasks and the resources they need to
+resume execution. The restart code is executed by each task that is
+restored in a new container to reconstruct its own state.
+
+
+=== Current Implementation
+
+* How useful is this code as it stands in real-world usage?
+
+Right now, the application must be a single process that does not
+share any resources with other processes. The only file descriptors
+that may be open are simple files and directories, they may not
+include devices, sockets or pipes.
+
+For an "external" checkpoint, the caller must first freeze (or stop)
+the target process. For "self" checkpoint, the application must be
+specifically written to use the new system calls. The restart does not
+yet preserve the pid of the original process, but will use whatever
+pid it was given by the kernel.
+
+What this means in practice is that it is useful for a simple
+application doing computational work and input/output from/to files.
+
+Currently, namespaces are not saved or restored. They will be treated
+as a class of a shared object. In particular, it is assumed that the
+task's file system namespace is the "root" for the entire container.
+It is also assumed that the same file system view is available for the
+restart task(s). Otherwise, a file system snapshot is required.
+
+* What additional work needs to be done to it?
+
+We know this design can work.  We have two commercial products and a
+horde of academic projects doing it today using this basic design.
+We're early in this particular implementation because we're trying to
+release early and often.
+
diff --git a/Documentation/checkpoint/rstr.c b/Documentation/checkpoint/rstr.c
new file mode 100644
index 0000000..288209d
--- /dev/null
+++ b/Documentation/checkpoint/rstr.c
@@ -0,0 +1,20 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <sys/syscall.h>
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	int ret;
+
+	ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
+	if (ret < 0)
+		perror("restart");
+
+	printf("should not reach here !\n");
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/security.txt b/Documentation/checkpoint/security.txt
new file mode 100644
index 0000000..e5b4107
--- /dev/null
+++ b/Documentation/checkpoint/security.txt
@@ -0,0 +1,38 @@
+
+	===== Security consideration for Checkpoint-Restart =====
+
+The main question is whether sys_checkpoint() and sys_restart()
+require privileged or unprivileged operation.
+
+Early versions checked capable(CAP_SYS_ADMIN) assuming that we would
+attempt to remove the need for privilege, so that all users could
+safely use it. Arnd Bergmann pointed out that it'd make more sense to
+let unprivileged users use them now, so that we'll be more careful
+about the security as patches roll in.
+
+Checkpoint: the main concern is whether a task that performs the
+checkpoint of another task has sufficient privileges to access its
+state. We address this by requiring that the checkpointer task will be
+able to ptrace the target task, by means of ptrace_may_access() with
+read mode.
+
+Restart: the main concern is that we may allow an unprivileged user to
+feed the kernel with random data. To this end, the restart works in a
+way that does not skip the usual security checks. Task credentials,
+i.e. euid, reuid, and LSM security contexts currently come from the
+caller, not the checkpoint image.  When restoration of credentials
+becomes supported, then definitely the ability of the task that calls
+sys_restore() to setresuid/setresgid to those values must be checked.
+
+Keeping the restart procedure to operate within the limits of the
+caller's credentials means that there various scenarios that cannot
+be supported. For instance, a setuid program that opened a protected
+log file and then dropped privileges will fail the restart, because
+the user won't have enough credentials to reopen the file. In these
+cases, we should probably treat restarting like inserting a kernel
+module: surely the user can cause havoc by providing incorrect data,
+but then again we must trust the root account.
+
+So that's why we don't want CAP_SYS_ADMIN required up-front. That way
+we will be forced to more carefully review each of those features.
+
diff --git a/Documentation/checkpoint/self.c b/Documentation/checkpoint/self.c
new file mode 100644
index 0000000..febb888
--- /dev/null
+++ b/Documentation/checkpoint/self.c
@@ -0,0 +1,57 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <math.h>
+#include <sys/syscall.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	pid_t pid = getpid();
+	FILE *file;
+	int i, ret;
+	float a;
+
+	close(0);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+
+		if (i == 2) {
+			ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
+			if (ret < 0) {
+				fprintf(file, "ckpt: %s\n", strerror(errno));
+				exit(2);
+			}
+			fprintf(file, "checkpoint ret: %d\n", ret);
+			fflush(file);
+		}
+	}
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/test.c b/Documentation/checkpoint/test.c
new file mode 100644
index 0000000..1183655
--- /dev/null
+++ b/Documentation/checkpoint/test.c
@@ -0,0 +1,48 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <math.h>
+
+#define OUTFILE  "/tmp/cr-test.out"
+
+int main(int argc, char *argv[])
+{
+	FILE *file;
+	float a;
+	int i;
+
+	close(0);
+	close(1);
+	close(2);
+
+	unlink(OUTFILE);
+	file = fopen(OUTFILE, "w+");
+	if (!file) {
+		perror("open");
+		exit(1);
+	}
+	if (dup2(0, 2) < 0) {
+		perror("dup2");
+		exit(1);
+	}
+
+	a = sqrt(2.53 * (getpid() / 1.21));
+
+	fprintf(file, "hello, world (%.2f)!\n", a);
+	fflush(file);
+
+	for (i = 0; i < 1000; i++) {
+		sleep(1);
+		/* make the fpu work ->  a = a + i/10  */
+		a = sqrt(a*a + 2*a*(i/10.0) + i*i/100.0);
+		fprintf(file, "count %d (%.2f)!\n", i, a);
+		fflush(file);
+	}
+
+	fprintf(file, "world, hello (%.2f) !\n", a);
+	fflush(file);
+
+	return 0;
+}
+
diff --git a/Documentation/checkpoint/usage.txt b/Documentation/checkpoint/usage.txt
new file mode 100644
index 0000000..1b42d6b
--- /dev/null
+++ b/Documentation/checkpoint/usage.txt
@@ -0,0 +1,171 @@
+
+	===== How to use Checkpoint-Restart =====
+
+The API consists of two new system calls:
+
+* int sys_checkpoint(pid_t pid, int fd, unsigned long flag);
+
+    Checkpoint a container whose init task is identified by pid, to
+    the file designated by fd. 'flags' will have future meaning (must
+    be 0 for now).
+
+    Returns: a positive checkpoint identifier (crid) upon success, 0
+    if it returns from a restart, and -1 if an error occurs.
+
+    'crid' uniquely identifies a checkpoint image. For each checkpoint
+    the kernel allocates a unique 'crid', that remains valid for as
+    long as the checkpoint is kept in the kernel (for instance, when a
+    checkpoint, or a partial checkpoint, may reside in kernel memory).
+
+* int sys_restart(int crid, int fd, unsigned long flags);
+
+    Restart a container from a checkpoint image that is read from the
+    blob stored in the file designated by fd. 'crid' will have future
+    meaning (must be 0 for now). 'flags' will have future meaning
+    (must be 0 for now).
+
+    The role of 'crid' is to identify the checkpoint image in the case
+    that it remains in kernel memory. This will be useful to restart
+    from a checkpoint image that remains in kernel memory.
+
+    Returns: -1 if an error occurs, 0 on success when restarting from
+    a "self" checkpoint, and return value of system call at the time
+    of the checkpoint when restarting from an "external" checkpoint.
+
+    If restarting from an "external" checkpoint, tasks that were
+    executing a system call will observe the return value of that
+    system call (as it was when interrupted for the act of taking the
+    checkpoint), and tasks that were executing in user space will be
+    ready to return there.
+
+    Upon successful "external" restart, the container will end up in a
+    frozen state.
+
+The granularity of a checkpoint usually is a whole container. The
+'pid' argument is interpreted in the caller's pid namespace. So to
+checkpoint a container whose init task (pid 1 in that pidns) appears
+as pid 3497 the caller's pidns, the caller must use pid 3497. Passing
+pid 1 will attempt to checkpoint the caller's container, and if the
+caller isn't privileged and init is owned by root, it will fail.
+
+If the caller passes a pid which does not refer to a container's init
+task, then sys_checkpoint() would return -EINVAL. (This is because
+with nested containers a task may belong to more than one container).
+
+We assume that during checkpoint and restart the container state is
+quiescent. During checkpoint, this means that all affected tasks are
+frozen (or otherwise stopped). During restart, this means that all
+affected tasks are executing the sys_restart() call. In both cases,
+if there are other tasks possible sharing state with the container,
+they must not modify it during the operation. It is the reponsibility
+of the caller to follow this requirement.
+
+If the assumption that all tasks are frozen and that there is no other
+sharing doesn't hold - then the results of the operation are undefined
+(just as, e.g. not calling execve() immediately after vfork() produces
+undefined results). In particular, either checkpoint will fail, or it
+may produce a checkpoint image that can't be restarted, or (unlikely)
+the restart may produce a container whose state does not match that of
+the original container.
+
+
+Here is a code snippet that illustrates how a checkpoint is initiated
+by a process in a container - the logic is similar to fork():
+	...
+	crid = checkpoint(1, ...);
+	switch (crid) {
+	case -1:
+		perror("checkpoint failed");
+		break;
+	default:
+		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
+		/* proceed with execution after checkpoint */
+		...
+		break;
+	case 0:
+		fprintf(stderr, "returned after restart\n");
+		/* proceed with action required following a restart */
+		...
+		break;
+	}
+	...
+
+And to initiate a restart, the process in an empty container can use
+logic similar to execve():
+	...
+	if (restart(crid, ...) < 0)
+		perror("restart failed");
+	/* only get here if restart failed */
+	...
+
+Note, that the code also supports "self" checkpoint, where a process
+can checkpoint itself. This mode does not capture the relationships
+of the task with other tasks, or any shared resources. It is useful
+for application that wish to be able to save and restore their state.
+They will either not use (or care about) shared resources, or they
+will be aware of the operations and adapt suitably after a restart.
+The code above can also be used for "self" checkpoint.
+
+To illustrate how the API works, refer to these sample programs:
+
+* ckpt.c: accepts a 'pid' argument and checkpoint that task to stdout
+* rstr.c: restarts a checkpoint image from stdin
+* self.c: a simple test program doing self-checkpoint
+* test.c: a simple test program to checkpoint
+
+"External" checkpoint:
+---------------------
+To do "external" checkpoint, you need to first freeze that other task
+either using the freezer cgroup, or by sending SIGSTOP.
+
+Restart does not preserve the original PID yet, (because we haven't
+solved yet the fork-with-specific-pid issue). In a real scenario, you
+probably want to first create a new names space, and have the init
+task there call 'sys_restart()'.
+
+I tested it this way:
+	$ ./test &
+	[1] 3493
+
+	$ kill -STOP 3493
+	$ ./ckpt 3493 > ckpt.image
+
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ kill -CONT 3493
+
+	$ ./rstr < ckpt.image
+Now compare the output of the two output files.
+
+"Self checkpoint:
+----------------
+To do "self" checkpoint, you can incorporate the code from ckpt.c into
+your application.
+
+Here is how to test the "self" checkpoint:
+	$ ./self > self.image &
+	[1] 3512
+
+	$ sleep 3
+	$ mv /tmp/cr-test.out /tmp/cr-test.out.orig
+	$ cp /tmp/cr-test.out.orig /tmp/cr-test.out
+
+	$ cat /tmp/cr-rest.out
+	hello, world (85.46)!
+	count 0 (85.46)!
+	count 1 (85.56)!
+	count 2 (85.76)!
+	count 3 (86.46)!
+
+	$ sed -i 's/count/xxxx/g' /tmp/cr-rest.out
+
+	$ ./rstr < self.image &
+Now compare the output of the two output files.
+
+Note how in test.c we close stdin, stdout, stderr - that's because
+currently we only support regular files (not ttys/ptys).
+
+If you check the output of ps, you'll see that "rstr" changed its name
+to "test" or "self", as expected.
+
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 03/14] Make file_pos_read/write() public
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
  2009-01-27 17:07 ` [RFC v13][PATCH 01/14] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
  2009-01-27 17:08 ` [RFC v13][PATCH 02/14] Checkpoint/restart: initial documentation Oren Laadan
@ 2009-01-27 17:08 ` Oren Laadan
  2009-01-27 17:08 ` [RFC v13][PATCH 04/14] General infrastructure for checkpoint restart Oren Laadan
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
These two are used in the next patch when calling vfs_read/write()
---
 fs/read_write.c    |   10 ----------
 include/linux/fs.h |   10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index 969a6d9..dda4eab 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -346,16 +346,6 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
 
 EXPORT_SYMBOL(vfs_write);
 
-static inline loff_t file_pos_read(struct file *file)
-{
-	return file->f_pos;
-}
-
-static inline void file_pos_write(struct file *file, loff_t pos)
-{
-	file->f_pos = pos;
-}
-
 asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf, size_t count)
 {
 	struct file *file;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4a853ef..39a54b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1344,6 +1344,16 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 				struct iovec *fast_pointer,
 				struct iovec **ret_pointer);
 
+static inline loff_t file_pos_read(struct file *file)
+{
+	return file->f_pos;
+}
+
+static inline void file_pos_write(struct file *file, loff_t pos)
+{
+	file->f_pos = pos;
+}
+
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 04/14] General infrastructure for checkpoint restart
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
                   ` (2 preceding siblings ...)
  2009-01-27 17:08 ` [RFC v13][PATCH 03/14] Make file_pos_read/write() public Oren Laadan
@ 2009-01-27 17:08 ` Oren Laadan
  2009-01-27 17:08 ` [RFC v13][PATCH 05/14] x86 support for checkpoint/restart Oren Laadan
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Add those interfaces, as well as helpers needed to easily manage the
file format. The code is roughly broken out as follows:
checkpoint/sys.c - user/kernel data transfer, as well as setup of the
  CR context (a per-checkpoint data structure for housekeeping)
checkpoint/checkpoint.c - output wrappers and basic checkpoint handling
checkpoint/restart.c - input wrappers and basic restart handling
For now, we can only checkpoint the 'current' task ("self" checkpoint),
and the 'pid' argument to to the syscall is ignored.
Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.
Changelog[v12]:
  - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer)
  - Split cr_write/cr_read() to two parts: _cr_write/read() helper
  - Befriend with sparse : explicit conversion to 'void __user *'
  - Redfine 'pr_fmt' instead of using special cr_debug()
Changelog[v10]:
  - add cr_write_buffer(), cr_read_buffer() and cr_read_buf_type()
  - force end-of-string in cr_read_string() (fix possible DoS)
Changelog[v9]:
  - cr_kwrite/cr_kread() use file->f_op->write() directly
  - Drop cr_uwrite/cr_uread() since they aren't used anywhere
Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (although it's not really needed)
Changelog[v5]:
  - Rename headers files s/ckpt/checkpoint/
Changelog[v2]:
  - Added utsname->{release,version,machine} to checkpoint header
  - Pad header structures to 64 bits to ensure compatibility
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 Makefile                       |    2 +-
 checkpoint/Makefile            |    2 +-
 checkpoint/checkpoint.c        |  188 +++++++++++++++++++++++++++++++
 checkpoint/restart.c           |  239 ++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c               |  216 +++++++++++++++++++++++++++++++++++-
 include/linux/checkpoint.h     |   61 ++++++++++
 include/linux/checkpoint_hdr.h |   85 ++++++++++++++
 include/linux/magic.h          |    3 +
 8 files changed, 790 insertions(+), 6 deletions(-)
 create mode 100644 checkpoint/checkpoint.c
 create mode 100644 checkpoint/restart.c
 create mode 100644 include/linux/checkpoint.h
 create mode 100644 include/linux/checkpoint_hdr.h
diff --git a/Makefile b/Makefile
index 71e98e9..c1744a1 100644
--- a/Makefile
+++ b/Makefile
@@ -619,7 +619,7 @@ export mod_strip_cmd
 
 
 ifeq ($(KBUILD_EXTMOD),)
-core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y		+= kernel/ mm/ fs/ ipc/ security/ crypto/ block/ checkpoint/
 
 vmlinux-dirs	:= $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 		     $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 07d018b..d2df68c 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,4 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
new file mode 100644
index 0000000..35bf99b
--- /dev/null
+++ b/checkpoint/checkpoint.c
@@ -0,0 +1,188 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/time.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/dcache.h>
+#include <linux/mount.h>
+#include <linux/utsname.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* unique checkpoint identifier (FIXME: should be per-container ?) */
+static atomic_t cr_ctx_count = ATOMIC_INIT(0);
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+	int ret;
+
+	ret = cr_kwrite(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+	return cr_kwrite(ctx, buf, h->len);
+}
+
+/**
+ * cr_write_buffer - write a buffer
+ * @ctx: checkpoint context
+ * @str: buffer pointer
+ * @len: buffer size
+ */
+int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_BUFFER;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, buf);
+}
+
+/**
+ * cr_write_string - write a string
+ * @ctx: checkpoint context
+ * @str: string pointer
+ * @len: string length
+ */
+int cr_write_string(struct cr_ctx *ctx, char *str, int len)
+{
+	struct cr_hdr h;
+
+	h.type = CR_HDR_STRING;
+	h.len = len;
+	h.parent = 0;
+
+	return cr_write_obj(ctx, &h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct new_utsname *uts;
+	struct timeval ktv;
+	int ret;
+
+	h.type = CR_HDR_HEAD;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	do_gettimeofday(&ktv);
+
+	hh->magic = CHECKPOINT_MAGIC_HEAD;
+	hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
+	hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
+	hh->patch = (LINUX_VERSION_CODE) & 0xff;
+
+	hh->rev = CR_VERSION;
+
+	hh->flags = ctx->flags;
+	hh->time = ktv.tv_sec;
+
+	uts = utsname();
+	memcpy(hh->release, uts->release, __NEW_UTS_LEN);
+	memcpy(hh->version, uts->version, __NEW_UTS_LEN);
+	memcpy(hh->machine, uts->machine, __NEW_UTS_LEN);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TAIL;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->magic = CHECKPOINT_MAGIC_TAIL;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_TASK;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->state = t->state;
+	hh->exit_state = t->exit_state;
+	hh->exit_code = t->exit_code;
+	hh->exit_signal = t->exit_signal;
+
+	hh->task_comm_len = TASK_COMM_LEN;
+
+	/* FIXME: save remaining relevant task_struct fields */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
+}
+
+/* dump the entire state of a given task */
+static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
+{
+	int ret;
+
+	ret = cr_write_task_struct(ctx, t);
+	pr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_write_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_task(ctx, current);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	ctx->crid = atomic_inc_return(&cr_ctx_count);
+
+	/* on success, return (unique) checkpoint identifier */
+	ret = ctx->crid;
+ out:
+	return ret;
+}
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
new file mode 100644
index 0000000..4741f4a
--- /dev/null
+++ b/checkpoint/restart.c
@@ -0,0 +1,239 @@
+/*
+ *  Restart logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/version.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/magic.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/**
+ * cr_read_obj - read a whole record (cr_hdr followed by payload)
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ * @len: available buffer size
+ */
+int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int len)
+{
+	int ret;
+
+	ret = cr_kread(ctx, h, sizeof(*h));
+	if (ret < 0)
+		return ret;
+
+	pr_debug("type %d len %d parent %d\n", h->type, h->len, h->parent);
+
+	if (h->len < 0 || h->len > len)
+		return -EINVAL;
+
+	return cr_kread(ctx, buf, h->len);
+}
+
+/**
+ * cr_read_obj_type - read a whole record of expected type and size
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: expected record size
+ * @type: expected record type
+ *
+ * Returns objref of the parent object
+ */
+int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, len);
+	if (ret < 0)
+		return ret;
+
+	if (h.len != len || h.type != type)
+		return -EINVAL;
+
+	return h.parent;
+}
+
+/**
+ * cr_read_buf_type - read a whole record of expected type (unknown size)
+ * @ctx: checkpoint context
+ * @buf: record buffer
+ * @n: availabe buffer size (output: actual record size)
+ * @type: expected record type
+ *
+ * Returns objref of the parent object
+ */
+int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type)
+{
+	struct cr_hdr h;
+	int ret;
+
+	ret = cr_read_obj(ctx, &h, buf, *len);
+	if (ret < 0)
+		return ret;
+
+	if (h.type != type)
+		return -EINVAL;
+
+	*len = h.len;
+	return h.parent;
+}
+
+/**
+ * cr_read_buffer - read a buffer
+ * @ctx: checkpoint context
+ * @buf: buffer
+ * @len: buffer size (output actual record size)
+ */
+int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len)
+{
+	return cr_read_buf_type(ctx, buf, len, CR_HDR_BUFFER);
+}
+
+/**
+ * cr_read_string - read a string
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @len: string length
+ */
+int cr_read_string(struct cr_ctx *ctx, char *str, int len)
+{
+	int ret;
+
+	ret = cr_read_buf_type(ctx, str, &len, CR_HDR_STRING);
+	if (ret < 0)
+		return ret;
+
+	if (len > 0)
+		str[len - 1] = '\0';	/* always play it safe */
+
+	return ret;
+}
+
+/* read the checkpoint header */
+static int cr_read_head(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->magic != CHECKPOINT_MAGIC_HEAD || hh->rev != CR_VERSION ||
+	    hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) ||
+	    hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) ||
+	    hh->patch != ((LINUX_VERSION_CODE) & 0xff))
+		goto out;
+
+	if (hh->flags & ~CR_CTX_CKPT)
+		goto out;
+
+	ctx->oflags = hh->flags;
+
+	/* FIX: verify compatibility of release, version and machine */
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the checkpoint trailer */
+static int cr_read_tail(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->magic != CHECKPOINT_MAGIC_TAIL)
+		goto out;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the task_struct into the current task */
+static int cr_read_task_struct(struct cr_ctx *ctx)
+{
+	struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	char *buf;
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	/* upper limit for task_comm_len to prevent DoS */
+	if (hh->task_comm_len < 0 || hh->task_comm_len > PAGE_SIZE)
+		goto out;
+
+	buf = kmalloc(hh->task_comm_len, GFP_KERNEL);
+	if (!buf)
+		goto out;
+	ret = cr_read_string(ctx, buf, hh->task_comm_len);
+	if (!ret) {
+		/* if t->comm is too long, silently truncate */
+		memset(t->comm, 0, TASK_COMM_LEN);
+		memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
+	}
+	kfree(buf);
+
+	/* FIXME: restore remaining relevant task_struct fields */
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* read the entire state of the current task */
+static int cr_read_task(struct cr_ctx *ctx)
+{
+	int ret;
+
+	ret = cr_read_task_struct(ctx);
+	pr_debug("ret %d\n", ret);
+
+	return ret;
+}
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tail(ctx);
+	if (ret < 0)
+		goto out;
+
+	/* on success, adjust the return value if needed [TODO] */
+ out:
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 375129c..76e2553 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -10,6 +10,180 @@
 
 #include <linux/sched.h>
 #include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/checkpoint.h>
+
+/*
+ * Helpers to write(read) from(to) kernel space to(from) the checkpoint
+ * image file descriptor (similar to how a core-dump is performed).
+ *
+ *   cr_kwrite() - write a kernel-space buffer to the checkpoint image
+ *   cr_kread() - read from the checkpoint image to a kernel-space buffer
+ */
+
+static inline int _cr_kwrite(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nwrite;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nwrite) {
+		loff_t pos = file_pos_read(file);
+		nwrite = vfs_write(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nwrite < 0) {
+			if (nwrite == -EAGAIN)
+				nwrite = 0;
+			else
+				return nwrite;
+		}
+		uaddr += nwrite;
+	}
+	return 0;
+}
+
+int cr_kwrite(struct cr_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _cr_kwrite(ctx->file, addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+static inline int _cr_kread(struct file *file, void *addr, int count)
+{
+	void __user *uaddr = (__force void __user *) addr;
+	ssize_t nread;
+	int nleft;
+
+	for (nleft = count; nleft; nleft -= nread) {
+		loff_t pos = file_pos_read(file);
+		nread = vfs_read(file, uaddr, nleft, &pos);
+		file_pos_write(file, pos);
+		if (nread <= 0) {
+			if (nread == -EAGAIN) {
+				nread = 0;
+				continue;
+			} else if (nread == 0)
+				nread = -EPIPE;		/* unexecpted EOF */
+			return nread;
+		}
+		uaddr += nread;
+	}
+	return 0;
+}
+
+int cr_kread(struct cr_ctx *ctx, void *addr, int count)
+{
+	mm_segment_t fs;
+	int ret;
+
+	fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = _cr_kread(ctx->file , addr, count);
+	set_fs(fs);
+
+	ctx->total += count;
+	return ret;
+}
+
+/*
+ * During checkpoint and restart the code writes outs/reads in data
+ * to/from the checkpoint image from/to a temporary buffer (ctx->hbuf).
+ * Because operations can be nested, use cr_hbuf_get() to reserve space
+ * in the buffer, then cr_hbuf_put() when you no longer need that space.
+ */
+
+/*
+ * ctx->hbuf is used to hold headers and data of known (or bound),
+ * static sizes. In some cases, multiple headers may be allocated in
+ * a nested manner. The size should accommodate all headers, nested
+ * or not, on all archs.
+ */
+#define CR_HBUF_TOTAL  (8 * 4096)
+
+/**
+ * cr_hbuf_get - reserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ *
+ * Returns pointer to reserved space
+ */
+void *cr_hbuf_get(struct cr_ctx *ctx, int n)
+{
+	void *ptr;
+
+	/*
+	 * Since requests depend on logic and static header sizes (not on
+	 * user data), space should always suffice, unless someone either
+	 * made a structure bigger or call path deeper than expected.
+	 */
+	BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
+	ptr = ctx->hbuf + ctx->hpos;
+	ctx->hpos += n;
+	return ptr;
+}
+
+/**
+ * cr_hbuf_put - unreserve space on the hbuf
+ * @ctx: checkpoint context
+ * @n: number of bytes to reserve
+ */
+void cr_hbuf_put(struct cr_ctx *ctx, int n)
+{
+	BUG_ON(ctx->hpos < n);
+	ctx->hpos -= n;
+}
+
+/*
+ * helpers to manage C/R contexts: allocated for each checkpoint and/or
+ * restart operation, and persists until the operation is completed.
+ */
+
+static void cr_ctx_free(struct cr_ctx *ctx)
+{
+	if (ctx->file)
+		fput(ctx->file);
+	kfree(ctx->hbuf);
+	kfree(ctx);
+}
+
+static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
+{
+	struct cr_ctx *ctx;
+	int err;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return ERR_PTR(-ENOMEM);
+
+	ctx->flags = flags;
+
+	err = -EBADF;
+	ctx->file = fget(fd);
+	if (!ctx->file)
+		goto err;
+
+	err = -ENOMEM;
+	ctx->hbuf = kmalloc(CR_HBUF_TOTAL, GFP_KERNEL);
+	if (!ctx->hbuf)
+		goto err;
+
+	return ctx;
+
+ err:
+	cr_ctx_free(ctx);
+	return ERR_PTR(err);
+}
 
 /**
  * sys_checkpoint - checkpoint a container
@@ -22,9 +196,26 @@
  */
 asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 {
-	pr_debug("sys_checkpoint not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_CKPT);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	ret = do_checkpoint(ctx, pid);
+
+	if (!ret)
+		ret = ctx->crid;
+
+	cr_ctx_free(ctx);
+	return ret;
 }
+
 /**
  * sys_restart - restart a container
  * @crid: checkpoint image identifier
@@ -36,6 +227,23 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	pr_debug("sys_restart not implemented yet\n");
-	return -ENOSYS;
+	struct cr_ctx *ctx;
+	pid_t pid;
+	int ret;
+
+	/* no flags for now */
+	if (flags)
+		return -EINVAL;
+
+	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
+	/* FIXME: for now, we use 'crid' as a pid */
+	pid = (pid_t) crid;
+
+	ret = do_restart(ctx, pid);
+
+	cr_ctx_free(ctx);
+	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
new file mode 100644
index 0000000..65a2cbf
--- /dev/null
+++ b/include/linux/checkpoint.h
@@ -0,0 +1,61 @@
+#ifndef _CHECKPOINT_CKPT_H_
+#define _CHECKPOINT_CKPT_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#define CR_VERSION  1
+
+struct cr_ctx {
+	int crid;		/* unique checkpoint id */
+
+	pid_t root_pid;		/* container identifier */
+
+	unsigned long flags;
+	unsigned long oflags;	/* restart: old flags */
+
+	struct file *file;
+	int total;		/* total read/written */
+
+	void *hbuf;		/* temporary buffer for headers */
+	int hpos;		/* position in headers buffer */
+};
+
+/* cr_ctx: flags */
+#define CR_CTX_CKPT	0x1
+#define CR_CTX_RSTR	0x2
+
+extern int cr_kwrite(struct cr_ctx *ctx, void *buf, int count);
+extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
+
+extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
+extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
+
+struct cr_hdr;
+
+extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
+extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
+extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
+extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
+extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
+extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
+extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
+
+extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+
+#ifdef pr_fmt
+#undef pr_fmt
+#endif
+
+#define pr_fmt(fmt) "[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__
+
+#endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
new file mode 100644
index 0000000..fcc0125
--- /dev/null
+++ b/include/linux/checkpoint_hdr.h
@@ -0,0 +1,85 @@
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#define _CHECKPOINT_CKPT_HDR_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+#include <linux/utsname.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(cr_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/* records: generic header */
+
+struct cr_hdr {
+	__s16 type;
+	__s16 len;
+	__u32 parent;
+};
+
+/* header types */
+enum {
+	CR_HDR_HEAD = 1,
+	CR_HDR_BUFFER,
+	CR_HDR_STRING,
+
+	CR_HDR_TASK = 101,
+	CR_HDR_THREAD,
+	CR_HDR_CPU,
+
+	CR_HDR_MM = 201,
+	CR_HDR_VMA,
+	CR_HDR_MM_CONTEXT,
+
+	CR_HDR_TAIL = 5001
+};
+
+struct cr_hdr_head {
+	__u64 magic;
+
+	__u16 major;
+	__u16 minor;
+	__u16 patch;
+	__u16 rev;
+
+	__u64 time;	/* when checkpoint taken */
+	__u64 flags;	/* checkpoint options */
+
+	char release[__NEW_UTS_LEN];
+	char version[__NEW_UTS_LEN];
+	char machine[__NEW_UTS_LEN];
+} __attribute__((aligned(8)));
+
+struct cr_hdr_tail {
+	__u64 magic;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_task {
+	__u32 state;
+	__u32 exit_state;
+	__u32 exit_code;
+	__u32 exit_signal;
+
+	__s32 task_comm_len;
+} __attribute__((aligned(8)));
+
+#endif /* _CHECKPOINT_CKPT_HDR_H_ */
diff --git a/include/linux/magic.h b/include/linux/magic.h
index f7f3fdd..5939bbe 100644
--- a/include/linux/magic.h
+++ b/include/linux/magic.h
@@ -46,4 +46,7 @@
 #define FUTEXFS_SUPER_MAGIC	0xBAD1DEA
 #define INOTIFYFS_SUPER_MAGIC	0x2BAD1DEA
 
+#define CHECKPOINT_MAGIC_HEAD  0x00feed0cc0a2d200LL
+#define CHECKPOINT_MAGIC_TAIL  0x002d2a0cc0deef00LL
+
 #endif /* __LINUX_MAGIC_H__ */
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 05/14] x86 support for checkpoint/restart
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
                   ` (3 preceding siblings ...)
  2009-01-27 17:08 ` [RFC v13][PATCH 04/14] General infrastructure for checkpoint restart Oren Laadan
@ 2009-01-27 17:08 ` Oren Laadan
  2009-02-24  7:47   ` Nathan Lynch
  2009-01-27 17:08 ` [RFC v13][PATCH 06/14] Dump memory address space Oren Laadan
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Add logic to save and restore architecture specific state, including
thread-specific state, CPU registers and FPU state.
In addition, architecture capabilities are saved in an architecure
specific extension of the header (cr_hdr_head_arch); Currently this
includes only FPU capabilities.
Currently only x86-32 is supported. Compiling on x86-64 will trigger
an explicit error.
Changelog[v12]:
  - A couple of missed calls to cr_hbuf_put()
  - Replace obsolete cr_debug() with pr_debug()
Changelog[v9]:
  - Add arch-specific header that details architecture capabilities;
    split FPU restore to send capabilities only once.
  - Test for zero TLS entries in cr_write_thread()
  - Fix asm/checkpoint_hdr.h so it can be included from user-space
Changelog[v7]:
  - Fix save/restore state of FPU
Changelog[v5]:
  - Remove preempt_disable() when restoring debug registers
Changelog[v4]:
  - Fix header structure alignment
Changelog[v2]:
  - Pad header structures to 64 bits to ensure compatibility
  - Follow Dave Hansen's refactoring of the original post
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |  100 ++++++++++++++
 arch/x86/mm/Makefile                  |    2 +
 arch/x86/mm/checkpoint.c              |  236 +++++++++++++++++++++++++++++++++
 arch/x86/mm/restart.c                 |  234 ++++++++++++++++++++++++++++++++
 checkpoint/checkpoint.c               |   19 +++-
 checkpoint/checkpoint_arch.h          |    9 ++
 checkpoint/restart.c                  |   17 ++-
 include/linux/checkpoint_hdr.h        |    2 +
 8 files changed, 613 insertions(+), 6 deletions(-)
 create mode 100644 arch/x86/include/asm/checkpoint_hdr.h
 create mode 100644 arch/x86/mm/checkpoint.c
 create mode 100644 arch/x86/mm/restart.c
 create mode 100644 checkpoint/checkpoint_arch.h
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..f966e70
--- /dev/null
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -0,0 +1,100 @@
+#ifndef __ASM_X86_CKPT_HDR_H
+#define __ASM_X86_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/types.h>
+
+/*
+ * To maintain compatibility between 32-bit and 64-bit architecture flavors,
+ * keep data 64-bit aligned: use padding for structure members, and use
+ * __attribute__ ((aligned (8))) for the entire structure.
+ *
+ * Quoting Arnd Bergmann:
+ *   "This structure has an odd multiple of 32-bit members, which means
+ *   that if you put it into a larger structure that also contains 64-bit
+ *   members, the larger structure may get different alignment on x86-32
+ *   and x86-64, which you might want to avoid. I can't tell if this is
+ *   an actual problem here. ... In this case, I'm pretty sure that
+ *   sizeof(cr_hdr_task) on x86-32 is different from x86-64, since it
+ *   will be 32-bit aligned on x86-32."
+ */
+
+/* i387 structure seen from kernel/userspace */
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#else
+#include <sys/user.h>
+#endif
+
+struct cr_hdr_head_arch {
+	/* FIXME: add HAVE_HWFP */
+
+	__u16 has_fxsr;
+	__u16 has_xsave;
+	__u16 xstate_size;
+	__u16 _pading;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_thread {
+	/* FIXME: restart blocks */
+
+	__s16 gdt_entry_tls_entries;
+	__s16 sizeof_tls_array;
+	__s16 ntls;	/* number of TLS entries to follow */
+} __attribute__((aligned(8)));
+
+struct cr_hdr_cpu {
+	/* see struct pt_regs (x86-64) */
+	__u64 r15;
+	__u64 r14;
+	__u64 r13;
+	__u64 r12;
+	__u64 bp;
+	__u64 bx;
+	__u64 r11;
+	__u64 r10;
+	__u64 r9;
+	__u64 r8;
+	__u64 ax;
+	__u64 cx;
+	__u64 dx;
+	__u64 si;
+	__u64 di;
+	__u64 orig_ax;
+	__u64 ip;
+	__u64 cs;
+	__u64 flags;
+	__u64 sp;
+	__u64 ss;
+
+	/* segment registers */
+	__u64 ds;
+	__u64 es;
+	__u64 fs;
+	__u64 gs;
+
+	/* debug registers */
+	__u64 debugreg0;
+	__u64 debugreg1;
+	__u64 debugreg2;
+	__u64 debugreg3;
+	__u64 debugreg4;
+	__u64 debugreg5;
+	__u64 debugreg6;
+	__u64 debugreg7;
+
+	__u32 uses_debug;
+	__u32 used_math;
+
+	/* thread_xstate contents follow (if used_math) */
+} __attribute__((aligned(8)));
+
+#endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index fea4565..6527ea2 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -18,3 +18,5 @@ obj-$(CONFIG_K8_NUMA)		+= k8topology_64.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat_$(BITS).o
 
 obj-$(CONFIG_MEMTEST)		+= memtest.o
+
+obj-$(CONFIG_CHECKPOINT_RESTART) += checkpoint.o restart.o
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
new file mode 100644
index 0000000..243a15c
--- /dev/null
+++ b/arch/x86/mm/checkpoint.c
@@ -0,0 +1,236 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct thread_struct *thread;
+	struct desc_struct *desc;
+	int ntls = 0;
+	int n, ret;
+
+	h.type = CR_HDR_THREAD;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	thread = &t->thread;
+
+	/* calculate no. of TLS entries that follow */
+	desc = thread->tls_array;
+	for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) {
+		if (desc->a || desc->b)
+			ntls++;
+	}
+
+	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+	hh->sizeof_tls_array = sizeof(thread->tls_array);
+	hh->ntls = ntls;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	pr_debug("ntls %d\n", ntls);
+	if (ntls == 0)
+		return 0;
+
+	/*
+	 * For simplicity dump the entire array, cherry-pick upon restart
+	 * FIXME: the TLS descriptors in the GDT should be called out and
+	 * not tied to the in-kernel representation.
+	 */
+	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+
+	/* IGNORE RESTART BLOCKS FOR NOW ... */
+
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static void cr_save_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	hh->bp = regs->bp;
+	hh->bx = regs->bx;
+	hh->ax = regs->ax;
+	hh->cx = regs->cx;
+	hh->dx = regs->dx;
+	hh->si = regs->si;
+	hh->di = regs->di;
+	hh->orig_ax = regs->orig_ax;
+	hh->ip = regs->ip;
+	hh->cs = regs->cs;
+	hh->flags = regs->flags;
+	hh->sp = regs->sp;
+	hh->ss = regs->ss;
+
+	hh->ds = regs->ds;
+	hh->es = regs->es;
+
+	/*
+	 * for checkpoint in process context (from within a container)
+	 * the GS and FS registers should be saved from the hardware;
+	 * otherwise they are already sabed on the thread structure
+	 */
+	if (t == current) {
+		savesegment(gs, hh->gs);
+		savesegment(fs, hh->fs);
+	} else {
+		hh->gs = thread->gs;
+		hh->fs = thread->fs;
+	}
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (t == current) {
+		BUG_ON(hh->orig_ax < 0);
+		hh->ax = 0;
+	}
+}
+
+static void cr_save_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+
+	/* debug regs */
+
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * get the actual registers; otherwise get the saved values.
+	 */
+
+	if (t == current) {
+		get_debugreg(hh->debugreg0, 0);
+		get_debugreg(hh->debugreg1, 1);
+		get_debugreg(hh->debugreg2, 2);
+		get_debugreg(hh->debugreg3, 3);
+		get_debugreg(hh->debugreg6, 6);
+		get_debugreg(hh->debugreg7, 7);
+	} else {
+		hh->debugreg0 = thread->debugreg0;
+		hh->debugreg1 = thread->debugreg1;
+		hh->debugreg2 = thread->debugreg2;
+		hh->debugreg3 = thread->debugreg3;
+		hh->debugreg6 = thread->debugreg6;
+		hh->debugreg7 = thread->debugreg7;
+	}
+
+	hh->debugreg4 = 0;
+	hh->debugreg5 = 0;
+
+	hh->uses_debug = !!(task_thread_info(t)->flags & TIF_DEBUG);
+}
+
+static void cr_save_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	hh->used_math = tsk_used_math(t) ? 1 : 0;
+}
+
+static int cr_write_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+	int ret;
+
+	/* i387 + MMU + SSE logic */
+	preempt_disable();	/* needed it (t == current) */
+
+	/*
+	 * normally, no need to unlazy_fpu(), since TS_USEDFPU flag
+	 * have been cleared when task was context-switched out...
+	 * except if we are in process context, in which case we do
+	 */
+	if (t == current && (task_thread_info(t)->status & TS_USEDFPU))
+		unlazy_fpu(current);
+
+	/*
+	 * For simplicity dump the entire structure.
+	 * FIXME: need to be deliberate about what registers we are
+	 * dumping for traceability and compatibility.
+	 */
+	memcpy(xstate_buf, t->thread.xstate, xstate_size);
+	preempt_enable();	/* needed it (t == current) */
+
+	ret = cr_kwrite(ctx, xstate_buf, xstate_size);
+	cr_hbuf_put(ctx, xstate_size);
+
+	return ret;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_CPU;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	cr_save_cpu_regs(hh, t);
+	cr_save_cpu_debug(hh, t);
+	cr_save_cpu_fpu(hh, t);
+
+	pr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_write_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_write_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_HEAD_ARCH;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	/* FPU capabilities */
+	hh->has_fxsr = cpu_has_fxsr;
+	hh->has_xsave = cpu_has_xsave;
+	hh->xstate_size = xstate_size;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
new file mode 100644
index 0000000..f5c3f16
--- /dev/null
+++ b/arch/x86/mm/restart.c
@@ -0,0 +1,234 @@
+/*
+ *  Checkpoint/restart - architecture specific support for x86
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <asm/desc.h>
+#include <asm/i387.h>
+
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+/* read the thread_struct into the current task */
+int cr_read_thread(struct cr_ctx *ctx)
+{
+	struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	struct thread_struct *thread = &t->thread;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		goto out;
+#endif
+	pr_debug("ntls %d\n", hh->ntls);
+
+	if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES ||
+	    hh->sizeof_tls_array != sizeof(thread->tls_array) ||
+	    hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES)
+		goto out;
+
+	if (hh->ntls > 0) {
+		struct desc_struct *desc;
+		int size, cpu;
+
+		/*
+		 * restore TLS by hand: why convert to struct user_desc if
+		 * sys_set_thread_entry() will convert it back ?
+		 */
+
+		size = sizeof(*desc) * GDT_ENTRY_TLS_ENTRIES;
+		desc = kmalloc(size, GFP_KERNEL);
+		if (!desc) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ret = cr_kread(ctx, desc, size);
+		if (ret == 0) {
+			/*
+			 * FIX: add sanity checks (eg. that values makes
+			 * sense, that we don't overwrite old values, etc
+			 */
+			cpu = get_cpu();
+			memcpy(thread->tls_array, desc, size);
+			load_TLS(thread, cpu);
+			put_cpu();
+		}
+		kfree(desc);
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+#ifdef CONFIG_X86_64
+
+#error "CONFIG_X86_64 unsupported yet."
+
+#else	/* !CONFIG_X86_64 */
+
+static int cr_load_cpu_regs(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	struct thread_struct *thread = &t->thread;
+	struct pt_regs *regs = task_pt_regs(t);
+
+	regs->bx = hh->bx;
+	regs->cx = hh->cx;
+	regs->dx = hh->dx;
+	regs->si = hh->si;
+	regs->di = hh->di;
+	regs->bp = hh->bp;
+	regs->ax = hh->ax;
+	regs->ds = hh->ds;
+	regs->es = hh->es;
+	regs->orig_ax = hh->orig_ax;
+	regs->ip = hh->ip;
+	regs->cs = hh->cs;
+	regs->flags = hh->flags;
+	regs->sp = hh->sp;
+	regs->ss = hh->ss;
+
+	thread->gs = hh->gs;
+	thread->fs = hh->fs;
+	loadsegment(gs, hh->gs);
+	loadsegment(fs, hh->fs);
+
+	return 0;
+}
+
+static int cr_load_cpu_debug(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	/* debug regs */
+
+	if (hh->uses_debug) {
+		set_debugreg(hh->debugreg0, 0);
+		set_debugreg(hh->debugreg1, 1);
+		/* ignore 4, 5 */
+		set_debugreg(hh->debugreg2, 2);
+		set_debugreg(hh->debugreg3, 3);
+		set_debugreg(hh->debugreg6, 6);
+		set_debugreg(hh->debugreg7, 7);
+	}
+
+	return 0;
+}
+
+static int cr_load_cpu_fpu(struct cr_hdr_cpu *hh, struct task_struct *t)
+{
+	preempt_disable();
+
+	__clear_fpu(t);		/* in case we used FPU in user mode */
+
+	if (!hh->used_math)
+		clear_used_math();
+
+	preempt_enable();
+	return 0;
+}
+
+static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
+	int ret;
+
+	ret = cr_kread(ctx, xstate_buf, xstate_size);
+	if (ret < 0)
+		goto out;
+
+	/* i387 + MMU + SSE */
+	preempt_disable();
+
+	/* init_fpu() also calls set_used_math() */
+	ret = init_fpu(current);
+	if (ret < 0)
+		return ret;
+
+	memcpy(t->thread.xstate, xstate_buf, xstate_size);
+	preempt_enable();
+ out:
+	cr_hbuf_put(ctx, xstate_size);
+	return 0;
+}
+
+#endif	/* CONFIG_X86_64 */
+
+/* read the cpu state and registers for the current task */
+int cr_read_cpu(struct cr_ctx *ctx)
+{
+	struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct task_struct *t = current;
+	int parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(t))
+		goto out;
+#endif
+	/* FIX: sanity check for sensitive registers (eg. eflags) */
+
+	pr_debug("math %d debug %d\n", hh->used_math, hh->uses_debug);
+
+	ret = cr_load_cpu_regs(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_debug(hh, t);
+	if (ret < 0)
+		goto out;
+	ret = cr_load_cpu_fpu(hh, t);
+	if (ret < 0)
+		goto out;
+
+	if (hh->used_math)
+		ret = cr_read_cpu_fpu(ctx, t);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_head_arch(struct cr_ctx *ctx)
+{
+	struct cr_hdr_head_arch *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, ret = 0;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD_ARCH);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	/* FIX: verify compatibility of architecture features */
+
+	/* verify FPU capabilities */
+	if (hh->has_fxsr != cpu_has_fxsr ||
+	    hh->has_xsave != cpu_has_xsave ||
+	    hh->xstate_size != xstate_size)
+		ret = -EINVAL;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 35bf99b..9c5430d 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -20,6 +20,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /* unique checkpoint identifier (FIXME: should be per-container ?) */
 static atomic_t cr_ctx_count = ATOMIC_INIT(0);
 
@@ -105,7 +107,10 @@ static int cr_write_head(struct cr_ctx *ctx)
 
 	ret = cr_write_obj(ctx, &h, hh);
 	cr_hbuf_put(ctx, sizeof(*hh));
-	return ret;
+	if (ret < 0)
+		return ret;
+
+	return cr_write_head_arch(ctx);
 }
 
 /* write the checkpoint trailer */
@@ -160,8 +165,16 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	int ret;
 
 	ret = cr_write_task_struct(ctx, t);
-	pr_debug("ret %d\n", ret);
-
+	pr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_thread(ctx, t);
+	pr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_write_cpu(ctx, t);
+	pr_debug("cpu: ret %d\n", ret);
+ out:
 	return ret;
 }
 
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
new file mode 100644
index 0000000..ada1369
--- /dev/null
+++ b/checkpoint/checkpoint_arch.h
@@ -0,0 +1,9 @@
+#include <linux/checkpoint.h>
+
+extern int cr_write_head_arch(struct cr_ctx *ctx);
+extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+
+extern int cr_read_head_arch(struct cr_ctx *ctx);
+extern int cr_read_thread(struct cr_ctx *ctx);
+extern int cr_read_cpu(struct cr_ctx *ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 4741f4a..f40b619 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -15,6 +15,8 @@
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+#include "checkpoint_arch.h"
+
 /**
  * cr_read_obj - read a whole record (cr_hdr followed by payload)
  * @ctx: checkpoint context
@@ -142,9 +144,9 @@ static int cr_read_head(struct cr_ctx *ctx)
 
 	ctx->oflags = hh->flags;
 
-	/* FIX: verify compatibility of release, version and machine */
+	/* FIX: verify compatibility of release, version */
 
-	ret = 0;
+	ret = cr_read_head_arch(ctx);
  out:
 	cr_hbuf_put(ctx, sizeof(*hh));
 	return ret;
@@ -214,8 +216,17 @@ static int cr_read_task(struct cr_ctx *ctx)
 	int ret;
 
 	ret = cr_read_task_struct(ctx);
-	pr_debug("ret %d\n", ret);
+	pr_debug("task_struct: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_thread(ctx);
+	pr_debug("thread: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_cpu(ctx);
+	pr_debug("cpu: ret %d\n", ret);
 
+ out:
 	return ret;
 }
 
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index fcc0125..3efd009 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -12,6 +12,7 @@
 
 #include <linux/types.h>
 #include <linux/utsname.h>
+#include <asm/checkpoint_hdr.h>
 
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
@@ -39,6 +40,7 @@ struct cr_hdr {
 /* header types */
 enum {
 	CR_HDR_HEAD = 1,
+	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
 
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 06/14] Dump memory address space
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
                   ` (4 preceding siblings ...)
  2009-01-27 17:08 ` [RFC v13][PATCH 05/14] x86 support for checkpoint/restart Oren Laadan
@ 2009-01-27 17:08 ` Oren Laadan
  2009-01-27 17:08 ` [RFC v13][PATCH 07/14] Restore " Oren Laadan
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
For each VMA, there is a 'struct cr_vma'; if the VMA is file-mapped,
it will be followed by the file name. Then comes the actual contents,
in one or more chunk: each chunk begins with a header that specifies
how many pages it holds, then the virtual addresses of all the dumped
pages in that chunk, followed by the actual contents of all dumped
pages. A header with zero number of pages marks the end of the contents.
Then comes the next VMA and so on.
Changelog[v13]:
  - pgprot_t is an abstract type; use the proper accessor (fix for
    64-bit powerpc (Nathan Lynch <ntl@pobox.com>)
Changelog[v12]:
  - Hide pgarr management inside cr_private_vma_fill_pgarr()
  - Fix management of pgarr chain reset and alloc/expand: keep empty
    pgarr in a pool chain
  - Replace obsolete cr_debug() with pr_debug()
Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them.
  - Add missing test for VM_MAYSHARE when dumping memory
Changelog[v10]:
  - Acquire dcache_lock around call to __d_path() in cr_fill_name()
Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
  - Test if __d_path() changes mnt/dentry (when crossing filesystem
    namespace boundary). for now cr_fill_fname() fails the checkpoint.
Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
Changelog[v5]:
  - Improve memory dump code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Fix use of follow_page() to avoid faulting in non-present pages
Changelog[v4]:
  - Use standard list_... for cr_pgarr
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/checkpoint.c              |   31 ++
 checkpoint/Makefile                   |    3 +-
 checkpoint/checkpoint.c               |   88 ++++++
 checkpoint/checkpoint_arch.h          |    2 +
 checkpoint/checkpoint_mem.h           |   41 +++
 checkpoint/ckpt_mem.c                 |  541 +++++++++++++++++++++++++++++++++
 checkpoint/sys.c                      |   11 +
 include/linux/checkpoint.h            |   13 +
 include/linux/checkpoint_hdr.h        |   32 ++
 10 files changed, 766 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/checkpoint_mem.h
 create mode 100644 checkpoint/ckpt_mem.c
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index f966e70..eb95705 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -97,4 +97,9 @@ struct cr_hdr_cpu {
 	/* thread_xstate contents follow (if used_math) */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm_context {
+	__s16 ldt_entry_size;
+	__s16 nldt;
+} __attribute__((aligned(8)));
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 243a15c..50bde9a 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -234,3 +234,34 @@ int cr_write_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+/* dump the mm->context state */
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int ret;
+
+	h.type = CR_HDR_MM_CONTEXT;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	mutex_lock(&mm->context.lock);
+
+	hh->ldt_entry_size = LDT_ENTRY_SIZE;
+	hh->nldt = mm->context.size;
+
+	pr_debug("nldt %d\n", hh->nldt);
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	ret = cr_kwrite(ctx, mm->context.ldt,
+			mm->context.size * LDT_ENTRY_SIZE);
+
+ out:
+	mutex_unlock(&mm->context.lock);
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index d2df68c..3a0df6d 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,4 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+		ckpt_mem.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 9c5430d..5c47184 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -13,6 +13,7 @@
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/dcache.h>
 #include <linux/mount.h>
 #include <linux/utsname.h>
@@ -75,6 +76,66 @@ int cr_write_string(struct cr_ctx *ctx, char *str, int len)
 	return cr_write_obj(ctx, &h, str);
 }
 
+/**
+ * cr_fill_fname - return pathname of a given file
+ * @path: path name
+ * @root: relative root
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ */
+static char *
+cr_fill_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+	struct path tmp = *root;
+	char *fname;
+
+	BUG_ON(!buf);
+	spin_lock(&dcache_lock);
+	fname = __d_path(path, &tmp, buf, *n);
+	spin_unlock(&dcache_lock);
+	if (!IS_ERR(fname))
+		*n = (buf + (*n) - fname);
+	/*
+	 * FIXME: if __d_path() changed these, it must have stepped out of
+	 * init's namespace. Since currently we require a unified namespace
+	 * within the container: simply fail.
+	 */
+	if (tmp.mnt != root->mnt || tmp.dentry != root->dentry)
+		fname = ERR_PTR(-EBADF);
+
+	return fname;
+}
+
+/**
+ * cr_write_fname - write a file name
+ * @ctx: checkpoint context
+ * @path: path name
+ * @root: relative root
+ */
+int cr_write_fname(struct cr_ctx *ctx, struct path *path, struct path *root)
+{
+	struct cr_hdr h;
+	char *buf, *fname;
+	int ret, flen;
+
+	flen = PATH_MAX;
+	buf = kmalloc(flen, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	fname = cr_fill_fname(path, root, buf, &flen);
+	if (!IS_ERR(fname)) {
+		h.type = CR_HDR_FNAME;
+		h.len = flen;
+		h.parent = 0;
+		ret = cr_write_obj(ctx, &h, fname);
+	} else
+		ret = PTR_ERR(fname);
+
+	kfree(buf);
+	return ret;
+}
+
 /* write the checkpoint header */
 static int cr_write_head(struct cr_ctx *ctx)
 {
@@ -168,6 +229,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	pr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_mm(ctx, t);
+	pr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	pr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -178,10 +243,33 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
+{
+	struct fs_struct *fs;
+
+	ctx->root_pid = pid;
+
+	/*
+	 * assume checkpointer is in container's root vfs
+	 * FIXME: this works for now, but will change with real containers
+	 */
+
+	fs = current->fs;
+	read_lock(&fs->lock);
+	ctx->fs_mnt = fs->root;
+	path_get(&ctx->fs_mnt);
+	read_unlock(&fs->lock);
+
+	return 0;
+}
+
 int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_checkpoint(ctx, pid);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index ada1369..f06c7eb 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -3,6 +3,8 @@
 extern int cr_write_head_arch(struct cr_ctx *ctx);
 extern int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
 extern int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_mm_context(struct cr_ctx *ctx,
+			       struct mm_struct *mm, int parent);
 
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
new file mode 100644
index 0000000..3e48bc4
--- /dev/null
+++ b/checkpoint/checkpoint_mem.h
@@ -0,0 +1,41 @@
+#ifndef _CHECKPOINT_CKPT_MEM_H_
+#define _CHECKPOINT_CKPT_MEM_H_
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/mm_types.h>
+
+/*
+ * page-array chains: each cr_pgarr describes a set of <struct page *,vaddr>
+ * tuples (where vaddr is the virtual address of a page in a particular mm).
+ * Specifically, we use separate arrays so that all vaddrs can be written
+ * and read at once.
+ */
+
+struct cr_pgarr {
+	unsigned long *vaddrs;
+	struct page **pages;
+	unsigned int nr_used;
+	struct list_head list;
+};
+
+#define CR_PGARR_TOTAL  (PAGE_SIZE / sizeof(void *))
+#define CR_PGARR_CHUNK  (4 * CR_PGARR_TOTAL)
+
+extern void cr_pgarr_free(struct cr_ctx *ctx);
+extern struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx);
+extern void cr_pgarr_reset_all(struct cr_ctx *ctx);
+
+static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
+{
+	return (pgarr->nr_used == CR_PGARR_TOTAL);
+}
+
+#endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
new file mode 100644
index 0000000..4925ff2
--- /dev/null
+++ b/checkpoint/ckpt_mem.c
@@ -0,0 +1,541 @@
+/*
+ *  Checkpoint memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr' (page-arrays)
+ * (common to ckpt_mem.c and rstr_mem.c).
+ *
+ * The checkpoint context structure has two members for page-arrays:
+ *   ctx->pgarr_list: list head of populated page-array chain
+ *   ctx->pgarr_pool: list head of empty page-array pool chain
+ *
+ * During checkpoint (and restart) the chain tracks the dirty pages (page
+ * pointer and virtual address) of each MM. For a particular MM, these are
+ * always added to the head of the page-array chain (ctx->pgarr_list).
+ * Before the next chunk of pages, the chain is reset (by dereferencing
+ * all pages) but not freed; instead, empty descsriptors are kept in pool.
+ *
+ * The head of the chain page-array ("current") advances as necessary. When
+ * it gets full, a new page-array descriptor is pushed in front of it. The
+ * new descriptor is taken from first empty descriptor (if one exists, for
+ * instance, after a chain reset), or allocated on-demand.
+ *
+ * When dumping the data, the chain is traversed in reverse order.
+ */
+
+/* return first page-array in the chain */
+static inline struct cr_pgarr *cr_pgarr_first(struct cr_ctx *ctx)
+{
+	if (list_empty(&ctx->pgarr_list))
+		return NULL;
+	return list_first_entry(&ctx->pgarr_list, struct cr_pgarr, list);
+}
+
+/* return (and detach) first empty page-array in the pool, if exists */
+static inline struct cr_pgarr *cr_pgarr_from_pool(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	if (list_empty(&ctx->pgarr_pool))
+		return NULL;
+	pgarr = list_first_entry(&ctx->pgarr_pool, struct cr_pgarr, list);
+	list_del(&pgarr->list);
+	return pgarr;
+}
+
+/* release pages referenced by a page-array */
+static void cr_pgarr_release_pages(struct cr_pgarr *pgarr)
+{
+	pr_debug("nr_used %d\n", pgarr->nr_used);
+	/*
+	 * both checkpoint and restart use 'nr_used', however we only
+	 * collect pages during checkpoint; in restart we simply return
+	 * because pgarr->pages remains NULL.
+	 */
+	if (pgarr->pages) {
+		struct page **pages = pgarr->pages;
+		int nr = pgarr->nr_used;
+
+		while (nr--)
+			page_cache_release(pages[nr]);
+	}
+
+	pgarr->nr_used = 0;
+}
+
+/* free a single page-array object */
+static void cr_pgarr_free_one(struct cr_pgarr *pgarr)
+{
+	cr_pgarr_release_pages(pgarr);
+	kfree(pgarr->pages);
+	kfree(pgarr->vaddrs);
+	kfree(pgarr);
+}
+
+/* free the chains of page-arrays (populated and empty pool) */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr, *tmp;
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_list, list) {
+		list_del(&pgarr->list);
+		cr_pgarr_free_one(pgarr);
+	}
+
+	list_for_each_entry_safe(pgarr, tmp, &ctx->pgarr_pool, list) {
+		list_del(&pgarr->list);
+		cr_pgarr_free_one(pgarr);
+	}
+}
+
+/* allocate a single page-array object */
+static struct cr_pgarr *cr_pgarr_alloc_one(unsigned long flags)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL);
+	if (!pgarr)
+		return NULL;
+
+	pgarr->vaddrs = kmalloc(CR_PGARR_TOTAL * sizeof(unsigned long),
+				GFP_KERNEL);
+	if (!pgarr->vaddrs)
+		goto nomem;
+
+	/* pgarr->pages is needed only for checkpoint */
+	if (flags & CR_CTX_CKPT) {
+		pgarr->pages = kmalloc(CR_PGARR_TOTAL * sizeof(struct page *),
+				       GFP_KERNEL);
+		if (!pgarr->pages)
+			goto nomem;
+	}
+
+	return pgarr;
+
+ nomem:
+	cr_pgarr_free_one(pgarr);
+	return NULL;
+}
+
+/* cr_pgarr_current - return the next available page-array in the chain
+ * @ctx: checkpoint context
+ *
+ * Returns the first page-array in the list that has space. Otherwise,
+ * try the next page-array after the last non-empty one, and move it to
+ * the front of the chain. Extends the list if none has space.
+ */
+struct cr_pgarr *cr_pgarr_current(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	pgarr = cr_pgarr_first(ctx);
+	if (pgarr && !cr_pgarr_is_full(pgarr))
+		return pgarr;
+
+	pgarr = cr_pgarr_from_pool(ctx);
+	if (!pgarr)
+		pgarr = cr_pgarr_alloc_one(ctx->flags);
+	if (!pgarr)
+		return NULL;
+
+	list_add(&pgarr->list, &ctx->pgarr_list);
+	return pgarr;
+}
+
+/* reset the page-array chain (dropping page references if necessary) */
+void cr_pgarr_reset_all(struct cr_ctx *ctx)
+{
+	struct cr_pgarr *pgarr;
+
+	list_for_each_entry(pgarr, &ctx->pgarr_list, list)
+		cr_pgarr_release_pages(pgarr);
+	list_splice_init(&ctx->pgarr_list, &ctx->pgarr_pool);
+}
+
+/*
+ * Checkpoint is outside the context of the checkpointee, so one cannot
+ * simply read pages from user-space. Instead, we scan the address space
+ * of the target to cherry-pick pages of interest. Selected pages are
+ * enlisted in a page-array chain (attached to the checkpoint context).
+ * To save their contents, each page is mapped to kernel memory and then
+ * dumped to the file descriptor.
+ */
+
+
+/**
+ * cr_private_follow_page - return page pointer for dirty pages
+ * @vma - target vma
+ * @addr - page address
+ *
+ * Looks up the page that correspond to the address in the vma, and
+ * returns the page if it was modified (and grabs a reference to it),
+ * or otherwise returns NULL (or error).
+ *
+ * This function should _only_ called for private vma's.
+ */
+static struct page *
+cr_consider_private_page(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct page *page;
+
+	/*
+	 * simplified version of get_user_pages(): already have vma,
+	 * only need FOLL_ANON, and (for now) ignore fault stats.
+	 *
+	 * follow_page() will return NULL if the page is not present
+	 * (swapped), ZERO_PAGE(0) if the pte wasn't allocated, and
+	 * the actual page pointer otherwise.
+	 *
+	 * FIXME: consolidate with get_user_pages()
+	 */
+
+	cond_resched();
+	while (!(page = follow_page(vma, addr, FOLL_ANON | FOLL_GET))) {
+		int ret;
+
+		/* the page is swapped out - bring it in (optimize ?) */
+		ret = handle_mm_fault(vma->vm_mm, vma, addr, 0);
+		if (ret & VM_FAULT_ERROR) {
+			if (ret & VM_FAULT_OOM)
+				return ERR_PTR(-ENOMEM);
+			else if (ret & VM_FAULT_SIGBUS)
+				return ERR_PTR(-EFAULT);
+			else
+				BUG();
+			break;
+		}
+		cond_resched();
+	}
+
+	if (IS_ERR(page))
+		return page;
+
+	/*
+	 * Only care about dirty pages: either anonymous non-zero pages,
+	 * or file-backed COW (copy-on-write) pages that were modified.
+	 * A clean COW page is not interesting because its contents are
+	 * identical to the backing file; ignore such pages.
+	 * A file-backed broken COW is identified by its page_mapping()
+	 * being unset (NULL) because the page will no longer be mapped
+	 * to the original file after having been modified.
+	 */
+	if (page == ZERO_PAGE(0)) {
+		/* this is the zero page: ignore */
+		page_cache_release(page);
+		page = NULL;
+	} else if (vma->vm_file && (page_mapping(page) != NULL)) {
+		/* file backed clean cow: ignore */
+		page_cache_release(page);
+		page = NULL;
+	}
+
+	return page;
+}
+
+/**
+ * cr_private_vma_fill_pgarr - fill a page-array with addr/page tuples
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ * @start - start address (updated)
+ *
+ * Returns the number of pages collected
+ */
+static int
+cr_private_vma_fill_pgarr(struct cr_ctx *ctx, struct vm_area_struct *vma,
+			  unsigned long *start)
+{
+	unsigned long end = vma->vm_end;
+	unsigned long addr = *start;
+	struct cr_pgarr *pgarr;
+	int nr_used;
+	int cnt = 0;
+
+	/* this function is only for private memory (anon or file-mapped) */
+	BUG_ON(vma->vm_flags & (VM_SHARED | VM_MAYSHARE));
+
+	do {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+
+		nr_used = pgarr->nr_used;
+
+		while (addr < end) {
+			struct page *page;
+
+			page = cr_consider_private_page(vma, addr);
+			if (IS_ERR(page))
+				return PTR_ERR(page);
+
+			if (page) {
+				pgarr->pages[pgarr->nr_used] = page;
+				pgarr->vaddrs[pgarr->nr_used] = addr;
+				pgarr->nr_used++;
+			}
+
+			addr += PAGE_SIZE;
+
+			if (cr_pgarr_is_full(pgarr))
+				break;
+		}
+
+		cnt += pgarr->nr_used - nr_used;
+
+	} while ((cnt < CR_PGARR_CHUNK) && (addr < end));
+
+	*start = addr;
+	return cnt;
+}
+
+/* dump contents of a pages: use kmap_atomic() to avoid TLB flush */
+static int cr_page_write(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(buf, ptr, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return cr_kwrite(ctx, buf, PAGE_SIZE);
+}
+
+/**
+ * cr_vma_dump_pages - dump pages listed in the ctx page-array chain
+ * @ctx - checkpoint context
+ * @total - total number of pages
+ *
+ * First dump all virtual addresses, followed by the contents of all pages
+ */
+static int cr_vma_dump_pages(struct cr_ctx *ctx, int total)
+{
+	struct cr_pgarr *pgarr;
+	void *buf;
+	int i, ret = 0;
+
+	if (!total)
+		return 0;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		ret = cr_kwrite(ctx, pgarr->vaddrs,
+				pgarr->nr_used * sizeof(*pgarr->vaddrs));
+		if (ret < 0)
+			return ret;
+	}
+
+	buf = (void *) __get_free_page(GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		for (i = 0; i < pgarr->nr_used; i++) {
+			ret = cr_page_write(ctx, pgarr->pages[i], buf);
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	free_page((unsigned long) buf);
+	return ret;
+}
+
+/**
+ * cr_write_private_vma_contents - dump contents of a VMA with private memory
+ * @ctx - checkpoint context
+ * @vma - vma to scan
+ *
+ * Collect lists of pages that needs to be dumped, and corresponding
+ * virtual addresses into ctx->pgarr_list page-array chain. Then dump
+ * the addresses, followed by the page contents.
+ */
+static int
+cr_write_private_vma_contents(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_pgarr *hh;
+	unsigned long addr = vma->vm_start;
+	int cnt, ret;
+
+	/*
+	 * Work iteratively, collecting and dumping at most CR_PGARR_CHUNK
+	 * in each round. Each iterations is divided into two steps:
+	 *
+	 * (1) scan: scan through the PTEs of the vma to collect the pages
+	 * to dump (later we'll also make them COW), while keeping a list
+	 * of pages and their corresponding addresses on ctx->pgarr_list.
+	 *
+	 * (2) dump: write out a header specifying how many pages, followed
+	 * by the addresses of all pages in ctx->pgarr_list, followed by
+	 * the actual contents of all pages. (Then, release the references
+	 * to the pages and reset the page-array chain).
+	 *
+	 * (This split makes the logic simpler by first counting the pages
+	 * that need saving. More importantly, it allows for a future
+	 * optimization that will reduce application downtime by deferring
+	 * the actual write-out of the data to after the application is
+	 * allowed to resume execution).
+	 *
+	 * After dumpting the entire contents, conclude with a header that
+	 * specifies 0 pages to mark the end of the contents.
+	 */
+
+	h.type = CR_HDR_PGARR;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	while (addr < vma->vm_end) {
+		cnt = cr_private_vma_fill_pgarr(ctx, vma, &addr);
+		if (cnt == 0)
+			break;
+		else if (cnt < 0)
+			return cnt;
+
+		hh = cr_hbuf_get(ctx, sizeof(*hh));
+		hh->nr_pages = cnt;
+		ret = cr_write_obj(ctx, &h, hh);
+		cr_hbuf_put(ctx, sizeof(*hh));
+		if (ret < 0)
+			return ret;
+
+		ret = cr_vma_dump_pages(ctx, cnt);
+		if (ret < 0)
+			return ret;
+
+		cr_pgarr_reset_all(ctx);
+	}
+
+	/* mark end of contents with header saying "0" pages */
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	hh->nr_pages = 0;
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	return ret;
+}
+
+static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma)
+{
+	struct cr_hdr h;
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int vma_type, ret;
+
+	h.type = CR_HDR_VMA;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->vm_start = vma->vm_start;
+	hh->vm_end = vma->vm_end;
+	hh->vm_page_prot = pgprot_val(vma->vm_page_prot);
+	hh->vm_flags = vma->vm_flags;
+	hh->vm_pgoff = vma->vm_pgoff;
+
+#define CR_BAD_VM_FLAGS  \
+	(VM_SHARED | VM_MAYSHARE | VM_IO | VM_HUGETLB | VM_NONLINEAR)
+
+	if (vma->vm_flags & CR_BAD_VM_FLAGS) {
+		pr_warning("c/r: unsupported VMA %#lx\n", vma->vm_flags);
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -ENOSYS;
+	}
+
+	/* by default assume anon memory */
+	vma_type = CR_VMA_ANON;
+
+	/*
+	 * if there is a backing file, assume private-mapped
+	 * (FIXME: check if the file is unlinked)
+	 */
+	if (vma->vm_file)
+		vma_type = CR_VMA_FILE;
+
+	hh->vma_type = vma_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	/* save the file name */
+	/* FIXME: files should be deposited and sought in the objhash */
+	if (vma->vm_file) {
+		ret = cr_write_fname(ctx, &vma->vm_file->f_path, &ctx->fs_mnt);
+		if (ret < 0)
+			return ret;
+	}
+
+	return cr_write_private_vma_contents(ctx, vma);
+}
+
+int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int objref, ret;
+
+	h.type = CR_HDR_MM;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	mm = get_task_mm(t);
+
+	objref = 0;	/* will be meaningful with multiple processes */
+	hh->objref = objref;
+
+	down_read(&mm->mmap_sem);
+
+	hh->start_code = mm->start_code;
+	hh->end_code = mm->end_code;
+	hh->start_data = mm->start_data;
+	hh->end_data = mm->end_data;
+	hh->start_brk = mm->start_brk;
+	hh->brk = mm->brk;
+	hh->start_stack = mm->start_stack;
+	hh->arg_start = mm->arg_start;
+	hh->arg_end = mm->arg_end;
+	hh->env_start = mm->env_start;
+	hh->env_end = mm->env_end;
+
+	hh->map_count = mm->map_count;
+
+	/* FIX: need also mm->flags */
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	/* write the vma's */
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		ret = cr_write_vma(ctx, vma);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_write_mm_context(ctx, mm, objref);
+
+ out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 76e2553..b5242fe 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -16,6 +16,8 @@
 #include <linux/capability.h>
 #include <linux/checkpoint.h>
 
+#include "checkpoint_mem.h"
+
 /*
  * Helpers to write(read) from(to) kernel space to(from) the checkpoint
  * image file descriptor (similar to how a core-dump is performed).
@@ -153,7 +155,13 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
 		fput(ctx->file);
+
 	kfree(ctx->hbuf);
+
+	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
+
+	cr_pgarr_free(ctx);
+
 	kfree(ctx);
 }
 
@@ -168,6 +176,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	INIT_LIST_HEAD(&ctx->pgarr_list);
+	INIT_LIST_HEAD(&ctx->pgarr_pool);
+
 	err = -EBADF;
 	ctx->file = fget(fd);
 	if (!ctx->file)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 65a2cbf..f8187ba 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,6 +10,9 @@
  *  distribution for more details.
  */
 
+#include <linux/path.h>
+#include <linux/fs.h>
+
 #define CR_VERSION  1
 
 struct cr_ctx {
@@ -25,6 +28,11 @@ struct cr_ctx {
 
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
+
+	struct list_head pgarr_list;	/* page array to dump VMA contents */
+	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
+
+	struct path fs_mnt;	/* container root (FIXME) */
 };
 
 /* cr_ctx: flags */
@@ -42,6 +50,8 @@ struct cr_hdr;
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
 extern int cr_write_buffer(struct cr_ctx *ctx, void *buf, int len);
 extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
+extern int cr_write_fname(struct cr_ctx *ctx,
+			  struct path *path, struct path *root);
 
 extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
 extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int len, int type);
@@ -50,7 +60,10 @@ extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
+extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
+extern int cr_read_mm(struct cr_ctx *ctx);
 
 #ifdef pr_fmt
 #undef pr_fmt
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3efd009..f3997da 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -43,6 +43,7 @@ enum {
 	CR_HDR_HEAD_ARCH,
 	CR_HDR_BUFFER,
 	CR_HDR_STRING,
+	CR_HDR_FNAME,
 
 	CR_HDR_TASK = 101,
 	CR_HDR_THREAD,
@@ -50,6 +51,7 @@ enum {
 
 	CR_HDR_MM = 201,
 	CR_HDR_VMA,
+	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
 	CR_HDR_TAIL = 5001
@@ -84,4 +86,34 @@ struct cr_hdr_task {
 	__s32 task_comm_len;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_mm {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 map_count;
+
+	__u64 start_code, end_code, start_data, end_data;
+	__u64 start_brk, brk, start_stack;
+	__u64 arg_start, arg_end, env_start, env_end;
+} __attribute__((aligned(8)));
+
+/* vma subtypes */
+enum vm_type {
+	CR_VMA_ANON = 1,
+	CR_VMA_FILE
+};
+
+struct cr_hdr_vma {
+	__u32 vma_type;
+	__u32 _padding;
+
+	__u64 vm_start;
+	__u64 vm_end;
+	__u64 vm_page_prot;
+	__u64 vm_flags;
+	__u64 vm_pgoff;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pgarr {
+	__u64 nr_pages;		/* number of pages to saved */
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 07/14] Restore memory address space
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
                   ` (5 preceding siblings ...)
  2009-01-27 17:08 ` [RFC v13][PATCH 06/14] Dump memory address space Oren Laadan
@ 2009-01-27 17:08 ` Oren Laadan
  2009-01-27 17:08 ` [RFC v13][PATCH 08/14] Infrastructure for shared objects Oren Laadan
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Restoring the memory address space begins with nuking the existing one
of the current process, and then reading the VMA state and contents.
Call do_mmap_pgoffset() for each VMA and then read in the data.
Changelog[v13]:
  - Avoid access to hh->vma_type after the header is freed
  - Test for no vma's in exit_mmap() before calling unmap_vma() (or it
    may crash if restart fails after having removed all vma's)
Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()
Changelog[v9]:
  - Introduce cr_ctx_checkpoint() for checkpoint-specific ctx setup
Changelog[v7]:
  - Fix argument given to kunmap_atomic() in memory dump/restore
Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
Changelog[v5]:
  - Improve memory restore code (following Dave Hansen's comments)
  - Change dump format (and code) to allow chunks of <vaddrs, pages>
    instead of one long list of each
  - Memory restore now maps user pages explicitly to copy data into them,
    instead of reading directly to user space; got rid of mprotect_fixup()
Changelog[v4]:
  - Use standard list_... for cr_pgarr
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |    5 +
 arch/x86/mm/restart.c                 |   58 +++++
 checkpoint/Makefile                   |    2 +-
 checkpoint/checkpoint_arch.h          |    2 +
 checkpoint/checkpoint_mem.h           |    5 +
 checkpoint/restart.c                  |   51 +++++
 checkpoint/rstr_mem.c                 |  386 +++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h            |    4 +
 mm/mmap.c                             |    2 +-
 9 files changed, 513 insertions(+), 2 deletions(-)
 create mode 100644 checkpoint/rstr_mem.c
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index eb95705..6185548 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -102,4 +102,9 @@ struct cr_hdr_mm_context {
 	__s16 nldt;
 } __attribute__((aligned(8)));
 
+#ifdef __KERNEL__
+/* misc prototypes from kernel (not defined elsewhere) */
+asmlinkage int sys_modify_ldt(int func, void __user *ptr, unsigned long bytecount);
+#endif
+
 #endif /* __ASM_X86_CKPT_HDR__H */
diff --git a/arch/x86/mm/restart.c b/arch/x86/mm/restart.c
index f5c3f16..a682a1d 100644
--- a/arch/x86/mm/restart.c
+++ b/arch/x86/mm/restart.c
@@ -232,3 +232,61 @@ int cr_read_head_arch(struct cr_ctx *ctx)
 
 	return ret;
 }
+
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm, int rparent)
+{
+	struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int n, parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT);
+	pr_debug("parent %d rparent %d nldt %d\n", parent, rparent, hh->nldt);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+	if (parent != rparent)
+		goto out;
+
+	if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE)
+		goto out;
+
+	/*
+	 * to utilize the syscall modify_ldt() we first convert the data
+	 * in the checkpoint image from 'struct desc_struct' to 'struct
+	 * user_desc' with reverse logic of include/asm/desc.h:fill_ldt()
+	 */
+
+	for (n = 0; n < hh->nldt; n++) {
+		struct user_desc info;
+		struct desc_struct desc;
+		mm_segment_t old_fs;
+
+		ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE);
+		if (ret < 0)
+			goto out;
+
+		info.entry_number = n;
+		info.base_addr = desc.base0 | (desc.base1 << 16);
+		info.limit = desc.limit0;
+		info.seg_32bit = desc.d;
+		info.contents = desc.type >> 2;
+		info.read_exec_only = (desc.type >> 1) ^ 1;
+		info.limit_in_pages = desc.g;
+		info.seg_not_present = desc.p ^ 1;
+		info.useable = desc.avl;
+
+		old_fs = get_fs();
+		set_fs(get_ds());
+		ret = sys_modify_ldt(1, (struct user_desc __user *) &info,
+				     sizeof(info));
+		set_fs(old_fs);
+
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 3a0df6d..ac35033 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
-		ckpt_mem.o
+		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/checkpoint_arch.h b/checkpoint/checkpoint_arch.h
index f06c7eb..39c8224 100644
--- a/checkpoint/checkpoint_arch.h
+++ b/checkpoint/checkpoint_arch.h
@@ -9,3 +9,5 @@ extern int cr_write_mm_context(struct cr_ctx *ctx,
 extern int cr_read_head_arch(struct cr_ctx *ctx);
 extern int cr_read_thread(struct cr_ctx *ctx);
 extern int cr_read_cpu(struct cr_ctx *ctx);
+extern int cr_read_mm_context(struct cr_ctx *ctx,
+			      struct mm_struct *mm, int parent);
diff --git a/checkpoint/checkpoint_mem.h b/checkpoint/checkpoint_mem.h
index 3e48bc4..de1d4c8 100644
--- a/checkpoint/checkpoint_mem.h
+++ b/checkpoint/checkpoint_mem.h
@@ -38,4 +38,9 @@ static inline int cr_pgarr_is_full(struct cr_pgarr *pgarr)
 	return (pgarr->nr_used == CR_PGARR_TOTAL);
 }
 
+static inline int cr_pgarr_nr_free(struct cr_pgarr *pgarr)
+{
+	return CR_PGARR_TOTAL - pgarr->nr_used;
+}
+
 #endif /* _CHECKPOINT_CKPT_MEM_H_ */
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index f40b619..536d017 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -120,6 +120,44 @@ int cr_read_string(struct cr_ctx *ctx, char *str, int len)
 	return ret;
 }
 
+/**
+ * cr_read_fname - read a file name
+ * @ctx: checkpoint context
+ * @fname: buffer
+ * @n: buffer length
+ */
+int cr_read_fname(struct cr_ctx *ctx, char *fname, int flen)
+{
+	return cr_read_buf_type(ctx, fname, &flen, CR_HDR_FNAME);
+}
+
+/**
+ * cr_read_open_fname - read a file name and open a file
+ * @ctx: checkpoint context
+ * @flags: file flags
+ * @mode: file mode
+ */
+struct file *cr_read_open_fname(struct cr_ctx *ctx, int flags, int mode)
+{
+	struct file *file;
+	char *fname;
+	int ret;
+
+	fname = kmalloc(PATH_MAX, GFP_KERNEL);
+	if (!fname)
+		return ERR_PTR(-ENOMEM);
+
+	ret = cr_read_fname(ctx, fname, PATH_MAX);
+	pr_debug("fname '%s' flags %#x mode %#x\n", fname, flags, mode);
+	if (ret >= 0)
+		file = filp_open(fname, flags, mode);
+	else
+		file = ERR_PTR(ret);
+
+	kfree(fname);
+	return file;
+}
+
 /* read the checkpoint header */
 static int cr_read_head(struct cr_ctx *ctx)
 {
@@ -219,6 +257,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	pr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_mm(ctx);
+	pr_debug("memory: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	pr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
@@ -230,10 +272,19 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* setup restart-specific parts of ctx */
+static int cr_ctx_restart(struct cr_ctx *ctx)
+{
+	return 0;
+}
+
 int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_ctx_restart(ctx);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_head(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
new file mode 100644
index 0000000..4d5ce1a
--- /dev/null
+++ b/checkpoint/rstr_mem.c
@@ -0,0 +1,386 @@
+/*
+ *  Restart memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fcntl.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/mm_types.h>
+#include <linux/mman.h>
+#include <linux/mm.h>
+#include <linux/err.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_arch.h"
+#include "checkpoint_mem.h"
+
+/*
+ * Unlike checkpoint, restart is executed in the context of each restarting
+ * process: vma regions are restored via a call to mmap(), and the data is
+ * read into the address space of the current process.
+ */
+
+/**
+ * cr_read_pages_vaddrs - read addresses of pages to page-array chain
+ * @ctx - restart context
+ * @nr_pages - number of address to read
+ */
+static int cr_read_pages_vaddrs(struct cr_ctx *ctx, unsigned long nr_pages)
+{
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrp;
+	int nr, ret;
+
+	while (nr_pages) {
+		pgarr = cr_pgarr_current(ctx);
+		if (!pgarr)
+			return -ENOMEM;
+		nr = cr_pgarr_nr_free(pgarr);
+		if (nr > nr_pages)
+			nr = nr_pages;
+		vaddrp = &pgarr->vaddrs[pgarr->nr_used];
+		ret = cr_kread(ctx, vaddrp, nr * sizeof(unsigned long));
+		if (ret < 0)
+			return ret;
+		pgarr->nr_used += nr;
+		nr_pages -= nr;
+	}
+	return 0;
+}
+
+static int cr_page_read(struct cr_ctx *ctx, struct page *page, char *buf)
+{
+	void *ptr;
+	int ret;
+
+	ret = cr_kread(ctx, buf, PAGE_SIZE);
+	if (ret < 0)
+		return ret;
+
+	ptr = kmap_atomic(page, KM_USER1);
+	memcpy(ptr, buf, PAGE_SIZE);
+	kunmap_atomic(ptr, KM_USER1);
+
+	return 0;
+}
+
+/**
+ * cr_read_pages_contents - read in data of pages in page-array chain
+ * @ctx - restart context
+ */
+static int cr_read_pages_contents(struct cr_ctx *ctx)
+{
+	struct mm_struct *mm = current->mm;
+	struct cr_pgarr *pgarr;
+	unsigned long *vaddrs;
+	char *buf;
+	int i, ret = 0;
+
+	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	down_read(&mm->mmap_sem);
+	list_for_each_entry_reverse(pgarr, &ctx->pgarr_list, list) {
+		vaddrs = pgarr->vaddrs;
+		for (i = 0; i < pgarr->nr_used; i++) {
+			struct page *page;
+
+			ret = get_user_pages(current, mm, vaddrs[i],
+					     1, 1, 1, &page, NULL);
+			if (ret < 0)
+				goto out;
+
+			ret = cr_page_read(ctx, page, buf);
+			page_cache_release(page);
+
+			if (ret < 0)
+				goto out;
+		}
+	}
+
+ out:
+	up_read(&mm->mmap_sem);
+	kfree(buf);
+	return 0;
+}
+
+/**
+ * cr_read_private_vma_contents - restore contents of a VMA with private memory
+ * @ctx - restart context
+ *
+ * Reads a header that specifies how many pages will follow, then reads
+ * a list of virtual addresses into ctx->pgarr_list page-array chain,
+ * followed by the actual contents of the corresponding pages. Iterates
+ * these steps until reaching a header specifying "0" pages, which marks
+ * the end of the contents.
+ */
+static int cr_read_private_vma_contents(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pgarr *hh;
+	unsigned long nr_pages;
+	int parent, ret = 0;
+
+	while (1) {
+		hh = cr_hbuf_get(ctx, sizeof(*hh));
+		parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_PGARR);
+		if (parent != 0) {
+			if (parent < 0)
+				ret = parent;
+			else
+				ret = -EINVAL;
+			cr_hbuf_put(ctx, sizeof(*hh));
+			break;
+		}
+
+		pr_debug("nr_pages %ld\n", (unsigned long) hh->nr_pages);
+
+		nr_pages = hh->nr_pages;
+		cr_hbuf_put(ctx, sizeof(*hh));
+
+		if (!nr_pages)
+			break;
+
+		ret = cr_read_pages_vaddrs(ctx, nr_pages);
+		if (ret < 0)
+			break;
+		ret = cr_read_pages_contents(ctx);
+		if (ret < 0)
+			break;
+		cr_pgarr_reset_all(ctx);
+	}
+
+	return ret;
+}
+
+/**
+ * cr_calc_map_prot_bits - convert vm_flags to mmap protection
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_prot = 0;
+
+	if (orig_vm_flags & VM_READ)
+		vm_prot |= PROT_READ;
+	if (orig_vm_flags & VM_WRITE)
+		vm_prot |= PROT_WRITE;
+	if (orig_vm_flags & VM_EXEC)
+		vm_prot |= PROT_EXEC;
+	if (orig_vm_flags & PROT_SEM)   /* only (?) with IPC-SHM  */
+		vm_prot |= PROT_SEM;
+
+	return vm_prot;
+}
+
+/**
+ * cr_calc_map_flags_bits - convert vm_flags to mmap flags
+ * orig_vm_flags: source vm_flags
+ */
+static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags)
+{
+	unsigned long vm_flags = 0;
+
+	vm_flags = MAP_FIXED;
+	if (orig_vm_flags & VM_GROWSDOWN)
+		vm_flags |= MAP_GROWSDOWN;
+	if (orig_vm_flags & VM_DENYWRITE)
+		vm_flags |= MAP_DENYWRITE;
+	if (orig_vm_flags & VM_EXECUTABLE)
+		vm_flags |= MAP_EXECUTABLE;
+	if (orig_vm_flags & VM_MAYSHARE)
+		vm_flags |= MAP_SHARED;
+	else
+		vm_flags |= MAP_PRIVATE;
+
+	return vm_flags;
+}
+
+static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm)
+{
+	struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	unsigned long vm_size, vm_start, vm_flags, vm_prot, vm_pgoff;
+	unsigned long addr;
+	enum vm_type vma_type;
+	struct file *file = NULL;
+	int parent, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA);
+	if (parent < 0) {
+		ret = parent;
+		goto err;
+	} else if (parent != 0)
+		goto err;
+
+	pr_debug("vma %#lx-%#lx type %d\n", (unsigned long) hh->vm_start,
+		 (unsigned long) hh->vm_end, (int) hh->vma_type);
+
+	if (hh->vm_end < hh->vm_start)
+		goto err;
+
+	vm_start = hh->vm_start;
+	vm_pgoff = hh->vm_pgoff;
+	vm_size = hh->vm_end - hh->vm_start;
+	vm_prot = cr_calc_map_prot_bits(hh->vm_flags);
+	vm_flags = cr_calc_map_flags_bits(hh->vm_flags);
+	vma_type = hh->vma_type;
+
+	cr_hbuf_put(ctx, sizeof(*hh));
+
+	switch (vma_type) {
+
+	case CR_VMA_ANON:		/* anonymous private mapping */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * vm_pgoff for anonymous mapping is the "global" page
+		 * offset (namely from addr 0x0), so we force a zero
+		 */
+		vm_pgoff = 0;
+		break;
+
+	case CR_VMA_FILE:		/* private mapping from a file */
+		if (vm_flags & VM_SHARED)
+			goto err;
+		/*
+		 * for private mapping using 'read-only' is sufficient
+		 */
+		file = cr_read_open_fname(ctx, O_RDONLY, 0);
+		if (IS_ERR(file)) {
+			ret = PTR_ERR(file);
+			goto err;
+		}
+		break;
+
+	default:
+		goto err;
+
+	}
+
+
+	down_write(&mm->mmap_sem);
+	addr = do_mmap_pgoff(file, vm_start, vm_size,
+			     vm_prot, vm_flags, vm_pgoff);
+	up_write(&mm->mmap_sem);
+	pr_debug("size %#lx prot %#lx flag %#lx pgoff %#lx => %#lx\n",
+		 vm_size, vm_prot, vm_flags, vm_pgoff, addr);
+
+	/* the file (if opened) is now referenced by the vma */
+	if (file)
+		filp_close(file, NULL);
+
+	if (IS_ERR((void *) addr))
+		return PTR_ERR((void *) addr);
+
+	/*
+	 * CR_VMA_ANON: read in memory as is
+	 * CR_VMA_FILE: read in memory as is
+	 * (more to follow ...)
+	 */
+
+	switch (vma_type) {
+	case CR_VMA_ANON:
+	case CR_VMA_FILE:
+		/* standard case: read the data into the memory */
+		ret = cr_read_private_vma_contents(ctx);
+		break;
+	}
+
+	if (ret < 0)
+		return ret;
+
+	pr_debug("vma retval %d\n", ret);
+	return 0;
+
+ err:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_destroy_mm(struct mm_struct *mm)
+{
+	struct vm_area_struct *vmnext = mm->mmap;
+	struct vm_area_struct *vma;
+	int ret;
+
+	while (vmnext) {
+		vma = vmnext;
+		vmnext = vmnext->vm_next;
+		ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start);
+		if (ret < 0) {
+			pr_debug("c/r: restart failed do_munmap (%d)\n", ret);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+int cr_read_mm(struct cr_ctx *ctx)
+{
+	struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct mm_struct *mm;
+	int nr, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		goto out;
+#endif
+	pr_debug("map_count %d\n", hh->map_count);
+
+	/* XXX need more sanity checks */
+	if (hh->start_code > hh->end_code ||
+	    hh->start_data > hh->end_data || hh->map_count < 0)
+		goto out;
+
+	mm = current->mm;
+
+	/* point of no return -- destruct current mm */
+	down_write(&mm->mmap_sem);
+	ret = cr_destroy_mm(mm);
+	if (ret < 0) {
+		up_write(&mm->mmap_sem);
+		goto out;
+	}
+	mm->start_code = hh->start_code;
+	mm->end_code = hh->end_code;
+	mm->start_data = hh->start_data;
+	mm->end_data = hh->end_data;
+	mm->start_brk = hh->start_brk;
+	mm->brk = hh->brk;
+	mm->start_stack = hh->start_stack;
+	mm->arg_start = hh->arg_start;
+	mm->arg_end = hh->arg_end;
+	mm->env_start = hh->env_start;
+	mm->env_end = hh->env_end;
+	up_write(&mm->mmap_sem);
+
+	/* FIX: need also mm->flags */
+
+	for (nr = hh->map_count; nr; nr--) {
+		ret = cr_read_vma(ctx, mm);
+		if (ret < 0)
+			goto out;
+	}
+
+	ret = cr_read_mm_context(ctx, mm, hh->objref);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index f8187ba..06b6e5a 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -59,6 +59,10 @@ extern int cr_read_buf_type(struct cr_ctx *ctx, void *buf, int *len, int type);
 extern int cr_read_buffer(struct cr_ctx *ctx, void *buf, int *len);
 extern int cr_read_string(struct cr_ctx *ctx, char *str, int len);
 
+extern int cr_read_fname(struct cr_ctx *ctx, char *fname, int n);
+extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
+				       int flags, int mode);
+
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
 
diff --git a/mm/mmap.c b/mm/mmap.c
index d4855a6..b98fb66 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2105,7 +2105,7 @@ void exit_mmap(struct mm_struct *mm)
 	tlb = tlb_gather_mmu(mm, 1);
 	/* Don't update_hiwater_rss(mm) here, do_exit already did */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL);
+	end = vma ? unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL) : 0;
 	vm_unacct_memory(nr_accounted);
 	free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
 	tlb_finish_mmu(tlb, 0, end);
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 08/14] Infrastructure for shared objects
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
                   ` (6 preceding siblings ...)
  2009-01-27 17:08 ` [RFC v13][PATCH 07/14] Restore " Oren Laadan
@ 2009-01-27 17:08 ` Oren Laadan
  2009-01-27 17:08 ` [RFC v13][PATCH 09/14] Dump open file descriptors Oren Laadan
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Infrastructure to handle objects that may be shared and referenced by
multiple tasks or other objects, e..g open files, memory address space
etc.
The state of shared objects is saved once. On the first encounter, the
state is dumped and the object is assigned a unique identifier (objref)
and also stored in a hash table (indexed by its physical kenrel address).
>From then on the object will be found in the hash and only its identifier
is saved.
On restart the identifier is looked up in the hash table; if not found
then the state is read, the object is created, and added to the hash
table (this time indexed by its identifier). Otherwise, the object in
the hash table is used.
The hash is "one-way": objects added to it are never deleted until the
hash it discarded. The hash is discarded at the end of checkpoint or
restart, whether successful or not.
The hash keeps a reference to every object that is added to it, matching
the object's type, and maintains this reference during its lifetime.
Therefore, it is always safe to use an object that is stored in the hash.
Changelog[v13]:
  - Use hash_long() with 'unsigned long' cast to support 64bit archs
    (Nathan Lynch <ntl@pobox.com>)
Changelog[v11]:
  - Doc: be explicit about grabbing a reference and object lifetime
Changelog[v4]:
  - Fix calculation of hash table size
Changelog[v3]:
  - Use standard hlist_... for hash table
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/objhash.c       |  280 ++++++++++++++++++++++++++++++++++++++++++++
 checkpoint/sys.c           |    4 +
 include/linux/checkpoint.h |   20 +++
 4 files changed, 305 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/objhash.c
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index ac35033..9843fb9 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -2,5 +2,5 @@
 # Makefile for linux checkpoint/restart.
 #
 
-obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o \
+obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
 		ckpt_mem.o rstr_mem.o
diff --git a/checkpoint/objhash.c b/checkpoint/objhash.c
new file mode 100644
index 0000000..ee31b38
--- /dev/null
+++ b/checkpoint/objhash.c
@@ -0,0 +1,280 @@
+/*
+ *  Checkpoint-restart - object hash infrastructure to manage shared objects
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/file.h>
+#include <linux/hash.h>
+#include <linux/checkpoint.h>
+
+struct cr_objref {
+	int objref;
+	void *ptr;
+	unsigned short type;
+	unsigned short flags;
+	struct hlist_node hash;
+};
+
+struct cr_objhash {
+	struct hlist_head *head;
+	int next_free_objref;
+};
+
+#define CR_OBJHASH_NBITS  10
+#define CR_OBJHASH_TOTAL  (1UL << CR_OBJHASH_NBITS)
+
+static void cr_obj_ref_drop(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		fput((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_obj_ref_grab(struct cr_objref *obj)
+{
+	switch (obj->type) {
+	case CR_OBJ_FILE:
+		get_file((struct file *) obj->ptr);
+		break;
+	default:
+		BUG();
+	}
+}
+
+static void cr_objhash_clear(struct cr_objhash *objhash)
+{
+	struct hlist_head *h = objhash->head;
+	struct hlist_node *n, *t;
+	struct cr_objref *obj;
+	int i;
+
+	for (i = 0; i < CR_OBJHASH_TOTAL; i++) {
+		hlist_for_each_entry_safe(obj, n, t, &h[i], hash) {
+			cr_obj_ref_drop(obj);
+			kfree(obj);
+		}
+	}
+}
+
+void cr_objhash_free(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash = ctx->objhash;
+
+	if (objhash) {
+		cr_objhash_clear(objhash);
+		kfree(objhash->head);
+		kfree(ctx->objhash);
+		ctx->objhash = NULL;
+	}
+}
+
+int cr_objhash_alloc(struct cr_ctx *ctx)
+{
+	struct cr_objhash *objhash;
+	struct hlist_head *head;
+
+	objhash = kzalloc(sizeof(*objhash), GFP_KERNEL);
+	if (!objhash)
+		return -ENOMEM;
+	head = kzalloc(CR_OBJHASH_TOTAL * sizeof(*head), GFP_KERNEL);
+	if (!head) {
+		kfree(objhash);
+		return -ENOMEM;
+	}
+
+	objhash->head = head;
+	objhash->next_free_objref = 1;
+
+	ctx->objhash = objhash;
+	return 0;
+}
+
+static struct cr_objref *cr_obj_find_by_ptr(struct cr_ctx *ctx, void *ptr)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_long((unsigned long) ptr,
+					  CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->ptr == ptr)
+			return obj;
+	return NULL;
+}
+
+static struct cr_objref *cr_obj_find_by_objref(struct cr_ctx *ctx, int objref)
+{
+	struct hlist_head *h;
+	struct hlist_node *n;
+	struct cr_objref *obj;
+
+	h = &ctx->objhash->head[hash_long((unsigned long) objref,
+					  CR_OBJHASH_NBITS)];
+	hlist_for_each_entry(obj, n, h, hash)
+		if (obj->objref == objref)
+			return obj;
+	return NULL;
+}
+
+/**
+ * cr_obj_new - allocate an object and add to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Allocate an object referring to @ptr and add to the hash table.
+ * If @objref is zero, assign a unique object reference and use @ptr
+ * as a hash key [checkpoint]. Else use @objref as a key [restart].
+ * In both cases, grab a reference (depending on @type) to said obejct.
+ */
+static struct cr_objref *cr_obj_new(struct cr_ctx *ctx, void *ptr, int objref,
+				    unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int i;
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return NULL;
+
+	obj->ptr = ptr;
+	obj->type = type;
+	obj->flags = flags;
+
+	if (objref) {
+		/* use @objref to index (restart) */
+		obj->objref = objref;
+		i = hash_long((unsigned long) objref, CR_OBJHASH_NBITS);
+	} else {
+		/* use @ptr to index, assign objref (checkpoint) */
+		obj->objref = ctx->objhash->next_free_objref++;;
+		i = hash_long((unsigned long) ptr, CR_OBJHASH_NBITS);
+	}
+
+	hlist_add_head(&obj->hash, &ctx->objhash->head[i]);
+	cr_obj_ref_grab(obj);
+	return obj;
+}
+
+/**
+ * cr_obj_add_ptr - add an object to the hash table if not already there
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique object reference [output]
+ * @type: object type
+ * @flags: object flags
+ *
+ * Look up the object pointed to by @ptr in the hash table. If it isn't
+ * already found there, then add the object to the table, and allocate a
+ * fresh unique object reference (objref). Grab a reference to every
+ * object that is added, and maintain the reference until the entire
+ * hash is free.
+ *
+ * Fills the unique objref of the object into @objref.
+ *
+ * [This is used during checkpoint].
+ *
+ * Returns 0 if found, 1 if added, < 0 on error
+ */
+int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+	int ret = 0;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj) {
+		obj = cr_obj_new(ctx, ptr, 0, type, flags);
+		if (!obj)
+			return -ENOMEM;
+		else
+			ret = 1;
+	} else if (obj->type != type)	/* sanity check */
+		return -EINVAL;
+	*objref = obj->objref;
+	return ret;
+}
+
+/**
+ * cr_obj_add_ref - add an object with unique objref to the hash table
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @objref: unique identifier - object reference
+ * @type: object type
+ * @flags: object flags
+ *
+ * Add the object pointer to by @ptr and identified by unique object
+ * reference given by @objref to the hash table (indexed by @objref).
+ * Grab a reference to every object that is added, and maintain the
+ * reference until the entire hash is free.
+ *
+ * [This is used during restart].
+ */
+int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+		   unsigned short type, unsigned short flags)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_new(ctx, ptr, objref, type, flags);
+	return obj ? 0 : -ENOMEM;
+}
+
+/**
+ * cr_obj_get_by_ptr - find the unique object reference of an object
+ * @ctx: checkpoint context
+ * @ptr: pointer to object
+ * @type: object type
+ *
+ * Look up the unique object reference (objref) of the object pointed
+ * to by @ptr, and return that number, or 0 if not found.
+ *
+ * [This is used during checkpoint].
+ */
+int cr_obj_get_by_ptr(struct cr_ctx *ctx, void *ptr, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_ptr(ctx, ptr);
+	if (!obj)
+		return -ESRCH;
+	if (obj->type != type)
+		return -EINVAL;
+	return obj->objref;
+}
+
+/**
+ * cr_obj_get_by_ref - find an object given its unique object reference
+ * @ctx: checkpoint context
+ * @objref: unique identifier - object reference
+ * @type: object type
+ *
+ * Look up the object who is identified by unique object reference that
+ * is specified by @objref, and return a pointer to that matching object,
+ * or NULL if not found.
+ *
+ * [This is used during restart].
+ */
+void *cr_obj_get_by_ref(struct cr_ctx *ctx, int objref, unsigned short type)
+{
+	struct cr_objref *obj;
+
+	obj = cr_obj_find_by_objref(ctx, objref);
+	if (!obj)
+		return NULL;
+	if (obj->type != type)
+		return ERR_PTR(-EINVAL);
+	return obj->ptr;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index b5242fe..a506b3a 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -161,6 +161,7 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	path_put(&ctx->fs_mnt);		/* safe with NULL pointers */
 
 	cr_pgarr_free(ctx);
+	cr_objhash_free(ctx);
 
 	kfree(ctx);
 }
@@ -189,6 +190,9 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (!ctx->hbuf)
 		goto err;
 
+	if (cr_objhash_alloc(ctx) < 0)
+		goto err;
+
 	return ctx;
 
  err:
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 06b6e5a..0ad4940 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -29,6 +29,8 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct cr_objhash *objhash;	/* hash for shared objects */
+
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
 
@@ -45,6 +47,24 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+/* shared objects handling */
+
+enum {
+	CR_OBJ_FILE = 1,
+	CR_OBJ_MAX
+};
+
+extern void cr_objhash_free(struct cr_ctx *ctx);
+extern int cr_objhash_alloc(struct cr_ctx *ctx);
+extern void *cr_obj_get_by_ref(struct cr_ctx *ctx,
+			       int objref, unsigned short type);
+extern int cr_obj_get_by_ptr(struct cr_ctx *ctx,
+			     void *ptr, unsigned short type);
+extern int cr_obj_add_ptr(struct cr_ctx *ctx, void *ptr, int *objref,
+			  unsigned short type, unsigned short flags);
+extern int cr_obj_add_ref(struct cr_ctx *ctx, void *ptr, int objref,
+			  unsigned short type, unsigned short flags);
+
 struct cr_hdr;
 
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 09/14] Dump open file descriptors
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
                   ` (7 preceding siblings ...)
  2009-01-27 17:08 ` [RFC v13][PATCH 08/14] Infrastructure for shared objects Oren Laadan
@ 2009-01-27 17:08 ` Oren Laadan
  2009-01-27 17:08 ` [RFC v13][PATCH 11/14] External checkpoint of a task other than ourself Oren Laadan
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Dump the files_struct of a task with 'struct cr_hdr_files', followed by
all open file descriptors. Because the 'struct file' corresponding to an
FD can be shared, each they are assigned an objref and registered in the
object hash. A reference to the 'file *' is kept for as long as it lives
in the hash (the hash is only cleaned up at the end of the checkpoint).
For each open FD there is a 'struct cr_hdr_fd_ent' with the FD, its
close-on-exec property, and the objref of the corresponding 'file *'.
If the FD is to be saved (first time) then this is followed by a
'struct cr_hdr_fd_data' with the FD state. Then will come the next FD
and so on.
Recall that it is assumed that all tasks possibly sharing the file table
are frozen. If this assumption breaks, then the behavior is *undefined*:
checkpoint may fail, or restart from the resulting image file will fail.
This patch only handles basic FDs - regular files, directories.
Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()
Changelog[v11]:
  - Discard handling of opened symlinks (there is no such thing)
  - cr_scan_fds() retries from scratch if hits size limits
Changelog[v9]:
  - Fix a couple of leaks in cr_write_files()
  - Drop useless kfree from cr_scan_fds()
Changelog[v8]:
  - initialize 'coe' to workaround gcc false warning
Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
 arch/x86/include/asm/checkpoint_hdr.h |    2 +-
 checkpoint/Makefile                   |    2 +-
 checkpoint/checkpoint.c               |    4 +
 checkpoint/checkpoint_file.h          |   17 +++
 checkpoint/ckpt_file.c                |  224 +++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h            |    3 +-
 include/linux/checkpoint_hdr.h        |   31 +++++-
 7 files changed, 279 insertions(+), 4 deletions(-)
 create mode 100644 checkpoint/checkpoint_file.h
 create mode 100644 checkpoint/ckpt_file.c
diff --git a/arch/x86/include/asm/checkpoint_hdr.h b/arch/x86/include/asm/checkpoint_hdr.h
index 6185548..43f21e4 100644
--- a/arch/x86/include/asm/checkpoint_hdr.h
+++ b/arch/x86/include/asm/checkpoint_hdr.h
@@ -15,7 +15,7 @@
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned (8))) for the entire structure.
  *
  * Quoting Arnd Bergmann:
  *   "This structure has an odd multiple of 32-bit members, which means
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 9843fb9..7496695 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 5c47184..dd0f527 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -233,6 +233,10 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	pr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_write_files(ctx, t);
+	pr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_write_thread(ctx, t);
 	pr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/checkpoint_file.h b/checkpoint/checkpoint_file.h
new file mode 100644
index 0000000..9dc3eba
--- /dev/null
+++ b/checkpoint/checkpoint_file.h
@@ -0,0 +1,17 @@
+#ifndef _CHECKPOINT_CKPT_FILE_H_
+#define _CHECKPOINT_CKPT_FILE_H_
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/fdtable.h>
+
+int cr_scan_fds(struct files_struct *files, int **fdtable);
+
+#endif /* _CHECKPOINT_CKPT_FILE_H_ */
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
new file mode 100644
index 0000000..e3097ac
--- /dev/null
+++ b/checkpoint/ckpt_file.c
@@ -0,0 +1,224 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+#define CR_DEFAULT_FDTABLE  256		/* an initial guess */
+
+/**
+ * cr_scan_fds - scan file table and construct array of open fds
+ * @files: files_struct pointer
+ * @fdtable: (output) array of open fds
+ *
+ * Returns the number of open fds found, and also the file table
+ * array via *fdtable. The caller should free the array.
+ *
+ * The caller must validate the file descriptors collected in the
+ * array before using them, e.g. by using fcheck_files(), in case
+ * the task's fdtable changes in the meantime.
+ */
+int cr_scan_fds(struct files_struct *files, int **fdtable)
+{
+	struct fdtable *fdt;
+	int *fds = NULL;
+	int i, n;
+	int tot = CR_DEFAULT_FDTABLE;
+
+	/*
+	 * We assume that all tasks possibly sharing the file table are
+	 * frozen (or we our a single process and we checkpoint ourselves).
+	 * Therefore, we can safely proceed after krealloc() from where we
+	 * left off. Otherwise the file table may be modified by another
+	 * task after we scan it. The behavior is this case is undefined,
+	 * and either and either checkpoint or restart will likely fail.
+	 */
+ retry:
+	fds = krealloc(fds, tot * sizeof(*fds), GFP_KERNEL);
+	if (!fds)
+		return -ENOMEM;
+
+	spin_lock(&files->file_lock);
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	for (n = 0, i = 0; i < fdt->max_fds; i++) {
+		if (!fcheck_files(files, i))
+			continue;
+		if (n == tot) {
+			spin_unlock(&files->file_lock);
+			rcu_read_unlock();
+			tot *= 2;	/* won't overflow: kmalloc will fail */
+			goto retry;
+		}
+		fds[n++] = i;
+	}
+	rcu_read_unlock();
+	spin_unlock(&files->file_lock);
+
+	*fdtable = fds;
+	return n;
+}
+
+/* cr_write_fd_data - dump the state of a given file pointer */
+static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct dentry *dent = file->f_dentry;
+	struct inode *inode = dent->d_inode;
+	enum fd_type fd_type;
+	int ret;
+
+	h.type = CR_HDR_FD_DATA;
+	h.len = sizeof(*hh);
+	h.parent = parent;
+
+	hh->f_flags = file->f_flags;
+	hh->f_mode = file->f_mode;
+	hh->f_pos = file->f_pos;
+	hh->f_version = file->f_version;
+	/* FIX: need also file->uid, file->gid, file->f_owner, etc */
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+		fd_type = CR_FD_FILE;
+		break;
+	case S_IFDIR:
+		fd_type = CR_FD_DIR;
+		break;
+	default:
+		cr_hbuf_put(ctx, sizeof(*hh));
+		return -EBADF;
+	}
+
+	/* FIX: check if the file/dir/link is unlinked */
+	hh->fd_type = fd_type;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	return cr_write_fname(ctx, &file->f_path, &ctx->fs_mnt);
+}
+
+/**
+ * cr_write_fd_ent - dump the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @fd: file descriptor
+ *
+ * Saves the state of the file descriptor; looks up the actual file
+ * pointer in the hash table, and if found saves the matching objref,
+ * otherwise calls cr_write_fd_data to dump the file pointer too.
+ */
+static int
+cr_write_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int fd)
+{
+	struct cr_hdr h;
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	struct fdtable *fdt;
+	int objref, new, ret;
+	int coe = 0;	/* avoid gcc warning */
+
+	rcu_read_lock();
+	fdt = files_fdtable(files);
+	file = fcheck_files(files, fd);
+	if (file) {
+		coe = FD_ISSET(fd, fdt->close_on_exec);
+		get_file(file);
+	}
+	rcu_read_unlock();
+
+	/* sanity check (although this shouldn't happen) */
+	if (!file) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	/* adding 'file' to the hash will keep a reference to it */
+	new = cr_obj_add_ptr(ctx, file, &objref, CR_OBJ_FILE, 0);
+	pr_debug("fd %d objref %d file %p c-o-e %d)\n", fd, objref, file, coe);
+
+	if (new < 0) {
+		ret = new;
+		goto out;
+	}
+
+	h.type = CR_HDR_FD_ENT;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh->objref = objref;
+	hh->fd = fd;
+	hh->close_on_exec = coe;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	if (ret < 0)
+		goto out;
+
+	/* new==1 if-and-only-if file was newly added to hash */
+	if (new)
+		ret = cr_write_fd_data(ctx, file, objref);
+
+out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (file)
+		fput(file);
+	return ret;
+}
+
+int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
+{
+	struct cr_hdr h;
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files;
+	int *fdtable = NULL;
+	int nfds, n, ret;
+
+	h.type = CR_HDR_FILES;
+	h.len = sizeof(*hh);
+	h.parent = task_pid_vnr(t);
+
+	files = get_files_struct(t);
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0) {
+		ret = nfds;
+		goto out;
+	}
+
+	hh->objref = 0;	/* will be meaningful with multiple processes */
+	hh->nfds = nfds;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		goto out;
+
+	pr_debug("nfds %d\n", nfds);
+	for (n = 0; n < nfds; n++) {
+		ret = cr_write_fd_ent(ctx, files, fdtable[n]);
+		if (ret < 0)
+			break;
+	}
+
+ out:
+	kfree(fdtable);
+	put_files_struct(files);
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 0ad4940..59cc515 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,7 +13,7 @@
 #include <linux/path.h>
 #include <linux/fs.h>
 
-#define CR_VERSION  1
+#define CR_VERSION  2
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -85,6 +85,7 @@ extern struct file *cr_read_open_fname(struct cr_ctx *ctx,
 
 extern int do_checkpoint(struct cr_ctx *ctx, pid_t pid);
 extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t);
+extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index f3997da..cf6a637 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -17,7 +17,7 @@
 /*
  * To maintain compatibility between 32-bit and 64-bit architecture flavors,
  * keep data 64-bit aligned: use padding for structure members, and use
- * __attribute__ ((aligned (8))) for the entire structure.
+ * __attribute__((aligned (8))) for the entire structure.
  *
  * Quoting Arnd Bergmann:
  *   "This structure has an odd multiple of 32-bit members, which means
@@ -54,6 +54,10 @@ enum {
 	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
+	CR_HDR_FILES = 301,
+	CR_HDR_FD_ENT,
+	CR_HDR_FD_DATA,
+
 	CR_HDR_TAIL = 5001
 };
 
@@ -116,4 +120,29 @@ struct cr_hdr_pgarr {
 	__u64 nr_pages;		/* number of pages to saved */
 } __attribute__((aligned(8)));
 
+struct cr_hdr_files {
+	__u32 objref;		/* identifier for shared objects */
+	__u32 nfds;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_fd_ent {
+	__u32 objref;		/* identifier for shared objects */
+	__s32 fd;
+	__u32 close_on_exec;
+} __attribute__((aligned(8)));
+
+/* fd types */
+enum  fd_type {
+	CR_FD_FILE = 1,
+	CR_FD_DIR,
+};
+
+struct cr_hdr_fd_data {
+	__u16 fd_type;
+	__u16 f_mode;
+	__u32 f_flags;
+	__u64 f_pos;
+	__u64 f_version;
+} __attribute__((aligned(8)));
+
 #endif /* _CHECKPOINT_CKPT_HDR_H_ */
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 10/14] Restore open file descriprtors
       [not found] ` <1233076092-8660-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-01-27 17:08   ` Oren Laadan
  2009-01-27 17:08   ` [RFC v13][PATCH 12/14] Track in-kernel when we expect checkpoint/restart to work Oren Laadan
  2009-01-27 17:08   ` [RFC v13][PATCH 14/14] Restart multiple processes Oren Laadan
  2 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Restore open file descriptors: for each FD read 'struct cr_hdr_fd_ent'
and lookup objref in the hash table; if not found (first occurence), read
in 'struct cr_hdr_fd_data', create a new FD and register in the hash.
Otherwise attach the file pointer from the hash as an FD.
This patch only handles basic FDs - regular files, directories and also
symbolic links.
Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()
Changelog[v6]:
  - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put()
    (even though it's not really needed)
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
---
 checkpoint/Makefile        |    2 +-
 checkpoint/restart.c       |    4 +
 checkpoint/rstr_file.c     |  248 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/checkpoint.h |    1 +
 4 files changed, 254 insertions(+), 1 deletions(-)
 create mode 100644 checkpoint/rstr_file.c
diff --git a/checkpoint/Makefile b/checkpoint/Makefile
index 7496695..88bbc10 100644
--- a/checkpoint/Makefile
+++ b/checkpoint/Makefile
@@ -3,4 +3,4 @@
 #
 
 obj-$(CONFIG_CHECKPOINT_RESTART) += sys.o checkpoint.o restart.o objhash.o \
-		ckpt_mem.o rstr_mem.o ckpt_file.o
+		ckpt_mem.o rstr_mem.o ckpt_file.o rstr_file.o
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 536d017..ece05b7 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -261,6 +261,10 @@ static int cr_read_task(struct cr_ctx *ctx)
 	pr_debug("memory: ret %d\n", ret);
 	if (ret < 0)
 		goto out;
+	ret = cr_read_files(ctx);
+	pr_debug("files: ret %d\n", ret);
+	if (ret < 0)
+		goto out;
 	ret = cr_read_thread(ctx);
 	pr_debug("thread: ret %d\n", ret);
 	if (ret < 0)
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
new file mode 100644
index 0000000..f44b081
--- /dev/null
+++ b/checkpoint/rstr_file.c
@@ -0,0 +1,248 @@
+/*
+ *  Checkpoint file descriptors
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/fsnotify.h>
+#include <linux/syscalls.h>
+#include <linux/checkpoint.h>
+#include <linux/checkpoint_hdr.h>
+
+#include "checkpoint_file.h"
+
+static int cr_close_all_fds(struct files_struct *files)
+{
+	int *fdtable;
+	int nfds;
+
+	nfds = cr_scan_fds(files, &fdtable);
+	if (nfds < 0)
+		return nfds;
+	while (nfds--)
+		sys_close(fdtable[nfds]);
+	kfree(fdtable);
+	return 0;
+}
+
+/**
+ * cr_attach_file - attach a lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+/**
+ * cr_attach_get_file - attach (and get) lonely file ptr to a file descriptor
+ * @file: lonely file pointer
+ */
+static int cr_attach_get_file(struct file *file)
+{
+	int fd = get_unused_fd_flags(0);
+
+	if (fd >= 0) {
+		fsnotify_open(file->f_path.dentry);
+		get_file(file);
+		fd_install(fd, file);
+	}
+	return fd;
+}
+
+#define CR_SETFL_MASK (O_APPEND|O_NONBLOCK|O_NDELAY|FASYNC|O_DIRECT|O_NOATIME)
+
+/* cr_read_fd_data - restore the state of a given file pointer */
+static int
+cr_read_fd_data(struct cr_ctx *ctx, struct files_struct *files, int rparent)
+{
+	struct cr_hdr_fd_data *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int parent, ret;
+	int fd = 0;	/* pacify gcc warning */
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_DATA);
+	pr_debug("rparent %d parent %d flags %#x mode %#x how %d\n",
+		 rparent, parent, hh->f_flags, hh->f_mode, hh->fd_type);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+	if (parent != rparent)
+		goto out;
+
+	/* FIX: more sanity checks on f_flags, f_mode etc */
+
+	switch (hh->fd_type) {
+	case CR_FD_FILE:
+	case CR_FD_DIR:
+		file = cr_read_open_fname(ctx, hh->f_flags, hh->f_mode);
+		break;
+	default:
+		goto out;
+	}
+
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	/* FIX: need to restore uid, gid, owner etc */
+
+	/* adding <objref,file> to the hash will keep a reference to it */
+	ret = cr_obj_add_ref(ctx, file, parent, CR_OBJ_FILE, 0);
+	if (ret < 0) {
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	fd = cr_attach_file(file);	/* no need to cleanup 'file' below */
+	if (fd < 0) {
+		ret = fd;
+		filp_close(file, NULL);
+		goto out;
+	}
+
+	ret = sys_fcntl(fd, F_SETFL, hh->f_flags & CR_SETFL_MASK);
+	if (ret < 0)
+		goto out;
+	ret = vfs_llseek(file, hh->f_pos, SEEK_SET);
+	if (ret == -ESPIPE)	/* ignore error on non-seekable files */
+		ret = 0;
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret < 0 ? ret : fd;
+}
+
+/**
+ * cr_read_fd_ent - restore the state of a given file descriptor
+ * @ctx: checkpoint context
+ * @files: files_struct pointer
+ * @parent: parent objref
+ *
+ * Restores the state of a file descriptor; looks up the objref (in the
+ * header) in the hash table, and if found picks the matching file and
+ * use it; otherwise calls cr_read_fd_data to restore the file too.
+ */
+static int
+cr_read_fd_ent(struct cr_ctx *ctx, struct files_struct *files, int rparent)
+{
+	struct cr_hdr_fd_ent *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct file *file;
+	int newfd, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FD_ENT);
+	pr_debug("rparent %d parent %d ref %d fd %d c.o.e %d\n",
+		 rparent, parent, hh->objref, hh->fd, hh->close_on_exec);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+
+	if (parent != rparent)
+		goto out;
+	if (hh->objref <= 0)
+		goto out;
+
+	file = cr_obj_get_by_ref(ctx, hh->objref, CR_OBJ_FILE);
+	if (IS_ERR(file)) {
+		ret = PTR_ERR(file);
+		goto out;
+	}
+
+	if (file) {
+		/* reuse file descriptor found in the hash table */
+		newfd = cr_attach_get_file(file);
+	} else {
+		/* create new file pointer (and register in hash table) */
+		newfd = cr_read_fd_data(ctx, files, hh->objref);
+	}
+
+	if (newfd < 0) {
+		ret = newfd;
+		goto out;
+	}
+
+	pr_debug("newfd got %d wanted %d\n", newfd, hh->fd);
+
+	/* if newfd isn't desired fd then reposition it */
+	if (newfd != hh->fd) {
+		ret = sys_dup2(newfd, hh->fd);
+		if (ret < 0)
+			goto out;
+		sys_close(newfd);
+	}
+
+	if (hh->close_on_exec)
+		set_close_on_exec(hh->fd, 1);
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+int cr_read_files(struct cr_ctx *ctx)
+{
+	struct cr_hdr_files *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	struct files_struct *files = current->files;
+	int i, parent, ret;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_FILES);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	}
+
+	ret = -EINVAL;
+#if 0	/* activate when containers are used */
+	if (parent != task_pid_vnr(current))
+		goto out;
+#endif
+	pr_debug("objref %d nfds %d\n", hh->objref, hh->nfds);
+	if (hh->objref < 0 || hh->nfds < 0)
+		goto out;
+
+	if (hh->nfds > sysctl_nr_open) {
+		ret = -EMFILE;
+		goto out;
+	}
+
+	/* point of no return -- close all file descriptors */
+	ret = cr_close_all_fds(files);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < hh->nfds; i++) {
+		ret = cr_read_fd_ent(ctx, files, hh->objref);
+		if (ret < 0)
+			break;
+	}
+
+	ret = 0;
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 59cc515..ea9ab4c 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -89,6 +89,7 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_struct *t);
 
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
+extern int cr_read_files(struct cr_ctx *ctx);
 
 #ifdef pr_fmt
 #undef pr_fmt
-- 
1.5.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 11/14] External checkpoint of a task other than ourself
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
                   ` (8 preceding siblings ...)
  2009-01-27 17:08 ` [RFC v13][PATCH 09/14] Dump open file descriptors Oren Laadan
@ 2009-01-27 17:08 ` Oren Laadan
  2009-01-27 17:08 ` [RFC v13][PATCH 13/14] Checkpoint multiple processes Oren Laadan
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Now we can do "external" checkpoint, i.e. act on another task.
sys_checkpoint() now looks up the target pid (in our namespace) and
checkpoints that corresponding task. That task should be the root of
a container.
sys_restart() remains the same, as the restart is always done in the
context of the restarting task.
Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()
Changelog[v11]:
  - Copy contents of 'init->fs->root' instead of pointing to them
Changelog[v10]:
  - Grab vfs root of container init, rather than current process
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 checkpoint/checkpoint.c    |   72 ++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/restart.c       |    4 +-
 checkpoint/sys.c           |    6 ++++
 include/linux/checkpoint.h |    2 +
 4 files changed, 80 insertions(+), 4 deletions(-)
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index dd0f527..e0af8a2 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/ptrace.h>
 #include <linux/time.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -225,6 +226,13 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
+	/* TODO: verity that the task is frozen (unless self) */
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task may not be in state TASK_DEAD\n");
+		return -EAGAIN;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	pr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -247,22 +255,82 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
+{
+	struct task_struct *task = NULL;
+	struct nsproxy *nsproxy = NULL;
+	int err = -ESRCH;
+
+	ctx->root_pid = pid;
+
+	read_lock(&tasklist_lock);
+	task = find_task_by_vpid(pid);
+	if (task)
+		get_task_struct(task);
+	read_unlock(&tasklist_lock);
+
+	if (!task)
+		goto out;
+
+#if 0	/* enable to use containers */
+	if (!is_container_init(task)) {
+		err = -EINVAL;
+		goto out;
+	}
+#endif
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	rcu_read_lock();
+	if (task_nsproxy(task)) {
+		nsproxy = task_nsproxy(task);
+		get_nsproxy(nsproxy);
+	}
+	rcu_read_unlock();
+
+	if (!nsproxy)
+		goto out;
+
+	/* TODO: verify that the container is frozen */
+
+	ctx->root_task = task;
+	ctx->root_nsproxy = nsproxy;
+
+	return 0;
+
+ out:
+	if (task)
+		put_task_struct(task);
+	return err;
+}
+
+/* setup checkpoint-specific parts of ctx */
 static int cr_ctx_checkpoint(struct cr_ctx *ctx, pid_t pid)
 {
 	struct fs_struct *fs;
+	int ret;
 
 	ctx->root_pid = pid;
 
+	ret = cr_get_container(ctx, pid);
+	if (ret < 0)
+		return ret;
+
 	/*
 	 * assume checkpointer is in container's root vfs
 	 * FIXME: this works for now, but will change with real containers
 	 */
 
-	fs = current->fs;
+	task_lock(ctx->root_task);
+	fs = ctx->root_task->fs;
 	read_lock(&fs->lock);
 	ctx->fs_mnt = fs->root;
 	path_get(&ctx->fs_mnt);
 	read_unlock(&fs->lock);
+	task_unlock(ctx->root_task);
 
 	return 0;
 }
@@ -277,7 +345,7 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, current);
+	ret = cr_write_task(ctx, ctx->root_task);
 	if (ret < 0)
 		goto out;
 	ret = cr_write_tail(ctx);
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index ece05b7..0c46abf 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -277,7 +277,7 @@ static int cr_read_task(struct cr_ctx *ctx)
 }
 
 /* setup restart-specific parts of ctx */
-static int cr_ctx_restart(struct cr_ctx *ctx)
+static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	return 0;
 }
@@ -286,7 +286,7 @@ int do_restart(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
-	ret = cr_ctx_restart(ctx);
+	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
 	ret = cr_read_head(ctx);
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index a506b3a..4a51ed3 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -9,6 +9,7 @@
  */
 
 #include <linux/sched.h>
+#include <linux/nsproxy.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
 #include <linux/file.h>
@@ -163,6 +164,11 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->root_nsproxy)
+		put_nsproxy(ctx->root_nsproxy);
+	if (ctx->root_task)
+		put_task_struct(ctx->root_task);
+
 	kfree(ctx);
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index ea9ab4c..cf54f47 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -19,6 +19,8 @@ struct cr_ctx {
 	int crid;		/* unique checkpoint id */
 
 	pid_t root_pid;		/* container identifier */
+	struct task_struct *root_task;	/* container root task */
+	struct nsproxy *root_nsproxy;	/* container root nsproxy */
 
 	unsigned long flags;
 	unsigned long oflags;	/* restart: old flags */
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 12/14] Track in-kernel when we expect checkpoint/restart to work
       [not found] ` <1233076092-8660-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-01-27 17:08   ` [RFC v13][PATCH 10/14] Restore open file descriprtors Oren Laadan
@ 2009-01-27 17:08   ` Oren Laadan
  2009-01-27 17:08   ` [RFC v13][PATCH 14/14] Restart multiple processes Oren Laadan
  2 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
From: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Suggested by Ingo.
Checkpoint/restart is going to be a long effort to get things working.
We're going to have a lot of things that we know just don't work for
a long time.  That doesn't mean that it will be useless, it just means
that there's some complicated features that we are going to have to
work incrementally to fix.
This patch introduces a new mechanism to help the checkpoint/restart
developers.  A new function pair: task/process_deny_checkpoint() is
created.  When called, these tell the kernel that we *know* that the
process has performed some activity that will keep it from being
properly checkpointed.
The 'flag' is an atomic_t for now so that we can have some level
of atomicity and make sure to only warn once.
For now, this is a one-way trip.  Once a process is no longer
'may_checkpoint' capable, neither it nor its children ever will be.
This can, of course, be fixed up in the future.  We might want to
reset the flag when a new pid namespace is created, for instance.
Signed-off-by: Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
---
 checkpoint/checkpoint.c    |    6 ++++++
 include/linux/checkpoint.h |   33 ++++++++++++++++++++++++++++++++-
 include/linux/sched.h      |    4 ++++
 kernel/fork.c              |   10 ++++++++++
 4 files changed, 52 insertions(+), 1 deletions(-)
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index e0af8a2..35e3c6b 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -233,6 +233,12 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 		return -EAGAIN;
 	}
 
+	if (!atomic_read(&t->may_checkpoint)) {
+		pr_warning("c/r: task %d may not checkpoint\n",
+			   task_pid_vnr(t));
+		return -EBUSY;
+	}
+
 	ret = cr_write_task_struct(ctx, t);
 	pr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index cf54f47..e867b95 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -10,8 +10,11 @@
  *  distribution for more details.
  */
 
-#include <linux/path.h>
 #include <linux/fs.h>
+#include <linux/path.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_CHECKPOINT_RESTART
 
 #define CR_VERSION  2
 
@@ -99,4 +102,32 @@ extern int cr_read_files(struct cr_ctx *ctx);
 
 #define pr_fmt(fmt) "[%d:c/r:%s] " fmt, task_pid_vnr(current), __func__
 
+static inline void __task_deny_checkpointing(struct task_struct *task,
+		char *file, int line)
+{
+	if (!atomic_dec_and_test(&task->may_checkpoint))
+		return;
+	printk(KERN_INFO "process performed an action that can not be "
+			"checkpointed at: %s:%d\n", file, line);
+	WARN_ON(1);
+}
+#define process_deny_checkpointing(p)  \
+	__task_deny_checkpointing(p, __FILE__, __LINE__)
+
+/*
+ * For now, we're not going to have a distinction between
+ * tasks and processes for the purpose of c/r.  But, allow
+ * these two calls anyway to make new users at least think
+ * about it.
+ */
+#define task_deny_checkpointing(p)  \
+	__task_deny_checkpointing(p, __FILE__, __LINE__)
+
+#else
+
+static inline void task_deny_checkpointing(struct task_struct *task) {}
+static inline void process_deny_checkpointing(struct task_struct *task) {}
+
+#endif
+
 #endif /* _CHECKPOINT_CKPT_H_ */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55e30d1..faa2ec6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1356,6 +1356,10 @@ struct task_struct {
 	unsigned long default_timer_slack_ns;
 
 	struct list_head	*scm_work_list;
+
+#ifdef CONFIG_CHECKPOINT_RESTART
+	atomic_t may_checkpoint;
+#endif
 };
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 495da2e..085ce56 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -196,6 +196,13 @@ void __init fork_init(unsigned long mempages)
 	init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
 	init_task.signal->rlim[RLIMIT_SIGPENDING] =
 		init_task.signal->rlim[RLIMIT_NPROC];
+
+#ifdef CONFIG_CHECKPOINT_RESTART
+	/*
+	 * This probably won't stay set for long...
+	 */
+	atomic_set(&init_task.may_checkpoint, 1);
+#endif
 }
 
 int __attribute__((weak)) arch_dup_task_struct(struct task_struct *dst,
@@ -246,6 +253,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
 	tsk->btrace_seq = 0;
 #endif
 	tsk->splice_pipe = NULL;
+#ifdef CONFIG_CHECKPOINT_RESTART
+	atomic_set(&tsk->may_checkpoint, atomic_read(&orig->may_checkpoint));
+#endif
 	return tsk;
 
 out:
-- 
1.5.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 13/14] Checkpoint multiple processes
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
                   ` (9 preceding siblings ...)
  2009-01-27 17:08 ` [RFC v13][PATCH 11/14] External checkpoint of a task other than ourself Oren Laadan
@ 2009-01-27 17:08 ` Oren Laadan
       [not found] ` <1233076092-8660-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-02-10 17:05 ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
  12 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, containers, linux-kernel, linux-mm, linux-api,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Checkpointing of multiple processes works by recording the tasks tree
structure below a given task (usually this task is the container init).
For a given task, do a DFS scan of the tasks tree and collect them
into an array (keeping a reference to each task). Using DFS simplifies
the recreation of tasks either in user space or kernel space. For each
task collected, test if it can be checkpointed, and save its pid, tgid,
and ppid.
The actual work is divided into two passes: a first scan counts the
tasks, then memory is allocated and a second scan fills the array.
The logic is suitable for creation of processes during restart either
in userspace or by the kernel.
Currently we ignore threads and zombies, as well as session ids.
Changelog[v13]:
  - Release tasklist_lock in error path in cr_tree_count_tasks()
  - Use separate index for 'tasks_arr' and 'hh' in cr_write_pids()
Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge Hallyn <serue@us.ibm.com>
---
 checkpoint/checkpoint.c        |  231 +++++++++++++++++++++++++++++++++++++---
 checkpoint/sys.c               |   16 +++
 include/linux/checkpoint.h     |    3 +
 include/linux/checkpoint_hdr.h |   13 ++-
 4 files changed, 246 insertions(+), 17 deletions(-)
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 35e3c6b..64155de 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -226,19 +226,6 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
 	int ret;
 
-	/* TODO: verity that the task is frozen (unless self) */
-
-	if (t->state == TASK_DEAD) {
-		pr_warning("c/r: task may not be in state TASK_DEAD\n");
-		return -EAGAIN;
-	}
-
-	if (!atomic_read(&t->may_checkpoint)) {
-		pr_warning("c/r: task %d may not checkpoint\n",
-			   task_pid_vnr(t));
-		return -EBUSY;
-	}
-
 	ret = cr_write_task_struct(ctx, t);
 	pr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
@@ -261,6 +248,208 @@ static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 	return ret;
 }
 
+/* dump all tasks in ctx->tasks_arr[] */
+static int cr_write_all_tasks(struct cr_ctx *ctx)
+{
+	int n, ret = 0;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		pr_debug("dumping task #%d\n", n);
+		ret = cr_write_task(ctx, ctx->tasks_arr[n]);
+		if (ret < 0)
+			break;
+	}
+
+	return ret;
+}
+
+static int cr_may_checkpoint_task(struct task_struct *t, struct cr_ctx *ctx)
+{
+	pr_debug("check %d\n", task_pid_nr_ns(t, ctx->root_nsproxy->pid_ns));
+
+	if (t->state == TASK_DEAD) {
+		pr_warning("c/r: task %d is TASK_DEAD\n", task_pid_vnr(t));
+		return -EAGAIN;
+	}
+
+	if (!atomic_read(&t->may_checkpoint)) {
+		pr_warning("c/r: task %d uncheckpointable\n", task_pid_vnr(t));
+		return -EBUSY;
+	}
+
+	if (!ptrace_may_access(t, PTRACE_MODE_READ))
+		return -EPERM;
+
+	/* FIXME: verify that the task is frozen (unless self) */
+
+	/* FIXME: change this for nested containers */
+	if (task_nsproxy(t) != ctx->root_nsproxy)
+		return -EPERM;
+
+	return 0;
+}
+
+#define CR_HDR_PIDS_CHUNK	256
+
+static int cr_write_pids(struct cr_ctx *ctx)
+{
+	struct cr_hdr_pids *hh;
+	struct pid_namespace *ns;
+	struct task_struct *task;
+	struct task_struct **tasks_arr;
+	int tasks_nr, n, pos = 0, ret = 0;
+
+	ns = ctx->root_nsproxy->pid_ns;
+	tasks_arr = ctx->tasks_arr;
+	tasks_nr = ctx->tasks_nr;
+	BUG_ON(tasks_nr <= 0);
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh) * CR_HDR_PIDS_CHUNK);
+
+	do {
+		rcu_read_lock();
+		for (n = 0; n < min(tasks_nr, CR_HDR_PIDS_CHUNK); n++) {
+			task = tasks_arr[pos];
+
+			/* is this task cool ? */
+			ret = cr_may_checkpoint_task(task, ctx);
+			if (ret < 0) {
+				rcu_read_unlock();
+				goto out;
+			}
+			hh[n].vpid = task_pid_nr_ns(task, ns);
+			hh[n].vtgid = task_tgid_nr_ns(task, ns);
+			hh[n].vppid = task_tgid_nr_ns(task->real_parent, ns);
+			pr_debug("task[%d]: vpid %d vtgid %d parent %d\n", pos,
+				 hh[n].vpid, hh[n].vtgid, hh[n].vppid);
+			pos++;
+		}
+		rcu_read_unlock();
+
+		n = min(tasks_nr, CR_HDR_PIDS_CHUNK);
+		ret = cr_kwrite(ctx, hh, n * sizeof(*hh));
+		if (ret < 0)
+			break;
+
+		tasks_nr -= n;
+	} while (tasks_nr > 0);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+/* count number of tasks in tree (and optionally fill pid's in array) */
+static int cr_tree_count_tasks(struct cr_ctx *ctx)
+{
+	struct task_struct *root = ctx->root_task;
+	struct task_struct *task = root;
+	struct task_struct *parent = NULL;
+	struct task_struct **tasks_arr = ctx->tasks_arr;
+	int tasks_nr = ctx->tasks_nr;
+	int nr = 0;
+
+	read_lock(&tasklist_lock);
+
+	/* count tasks via DFS scan of the tree */
+	while (1) {
+		if (tasks_arr) {
+			/* unlikely... but if so then try again later */
+			if (nr == tasks_nr) {
+				nr = -EAGAIN;	/* cleanup in cr_ctx_free() */
+				break;
+			}
+			tasks_arr[nr] = task;
+			get_task_struct(task);
+		}
+
+		nr++;
+
+		/* if has children - proceed with child */
+		if (!list_empty(&task->children)) {
+			parent = task;
+			task = list_entry(task->children.next,
+					  struct task_struct, sibling);
+			continue;
+		}
+
+		while (task != root) {
+			/* if has sibling - proceed with sibling */
+			if (!list_is_last(&task->sibling, &parent->children)) {
+				task = list_entry(task->sibling.next,
+						  struct task_struct, sibling);
+				break;
+			}
+
+			/* else, trace back to parent and proceed */
+			task = parent;
+			parent = parent->real_parent;
+		}
+
+		if (task == root)
+			break;
+	}
+
+	read_unlock(&tasklist_lock);
+	return nr;
+}
+
+/*
+ * cr_build_tree - scan the tasks tree in DFS order and fill in array
+ * @ctx: checkpoint context
+ *
+ * Using DFS order simplifies the restart logic to re-create the tasks.
+ *
+ * On success, ctx->tasks_arr will be allocated and populated with all
+ * tasks (reference taken), and ctx->tasks_nr will hold the total count.
+ * The array is cleaned up by cr_ctx_free().
+ */
+static int cr_build_tree(struct cr_ctx *ctx)
+{
+	int n, m;
+
+	/* count tasks (no side effects) */
+	n = cr_tree_count_tasks(ctx);
+
+	ctx->tasks_nr = n;
+	ctx->tasks_arr = kzalloc(n * sizeof(*ctx->tasks_arr), GFP_KERNEL);
+	if (!ctx->tasks_arr)
+		return -ENOMEM;
+
+	/* count again (now will fill array) */
+	m = cr_tree_count_tasks(ctx);
+
+	/* unlikely, but ... (cleanup in cr_ctx_free) */
+	if (m < 0)
+		return m;
+	else if (m != n)
+		return -EBUSY;
+
+	return 0;
+}
+
+/* dump the array that describes the tasks tree */
+static int cr_write_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr h;
+	struct cr_hdr_tree *hh;
+	int ret;
+
+	h.type = CR_HDR_TREE;
+	h.len = sizeof(*hh);
+	h.parent = 0;
+
+	hh = cr_hbuf_get(ctx, sizeof(*hh));
+	hh->tasks_nr = ctx->tasks_nr;
+
+	ret = cr_write_obj(ctx, &h, hh);
+	cr_hbuf_put(ctx, sizeof(*hh));
+	if (ret < 0)
+		return ret;
+
+	ret = cr_write_pids(ctx);
+	return ret;
+}
+
 static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 {
 	struct task_struct *task = NULL;
@@ -278,7 +467,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!task)
 		goto out;
 
-#if 0	/* enable to use containers */
+#if 0	/* enable with containers */
 	if (!is_container_init(task)) {
 		err = -EINVAL;
 		goto out;
@@ -300,7 +489,7 @@ static int cr_get_container(struct cr_ctx *ctx, pid_t pid)
 	if (!nsproxy)
 		goto out;
 
-	/* TODO: verify that the container is frozen */
+	/* FIXME: verify that the container is frozen */
 
 	ctx->root_task = task;
 	ctx->root_nsproxy = nsproxy;
@@ -348,12 +537,22 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
 	ret = cr_ctx_checkpoint(ctx, pid);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_build_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_head(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_write_task(ctx, ctx->root_task);
+	ret = cr_write_tree(ctx);
 	if (ret < 0)
 		goto out;
+
+	ret = cr_write_all_tasks(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_write_tail(ctx);
 	if (ret < 0)
 		goto out;
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 4a51ed3..0436ef3 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -152,6 +152,19 @@ void cr_hbuf_put(struct cr_ctx *ctx, int n)
  * restart operation, and persists until the operation is completed.
  */
 
+static void cr_task_arr_free(struct cr_ctx *ctx)
+{
+	int n;
+
+	for (n = 0; n < ctx->tasks_nr; n++) {
+		if (ctx->tasks_arr[n]) {
+			put_task_struct(ctx->tasks_arr[n]);
+			ctx->tasks_arr[n] = NULL;
+		}
+	}
+	kfree(ctx->tasks_arr);
+}
+
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
 	if (ctx->file)
@@ -164,6 +177,9 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
+	if (ctx->tasks_arr)
+		cr_task_arr_free(ctx);
+
 	if (ctx->root_nsproxy)
 		put_nsproxy(ctx->root_nsproxy);
 	if (ctx->root_task)
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index e867b95..86fcec9 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -34,6 +34,9 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
+	struct task_struct **tasks_arr;	/* array of all tasks in container */
+	int tasks_nr;			/* size of tasks array */
+
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index cf6a637..6dc739f 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -45,7 +45,8 @@ enum {
 	CR_HDR_STRING,
 	CR_HDR_FNAME,
 
-	CR_HDR_TASK = 101,
+	CR_HDR_TREE = 101,
+	CR_HDR_TASK,
 	CR_HDR_THREAD,
 	CR_HDR_CPU,
 
@@ -81,6 +82,16 @@ struct cr_hdr_tail {
 	__u64 magic;
 } __attribute__((aligned(8)));
 
+struct cr_hdr_tree {
+	__u32 tasks_nr;
+} __attribute__((aligned(8)));
+
+struct cr_hdr_pids {
+	__s32 vpid;
+	__s32 vtgid;
+	__s32 vppid;
+} __attribute__((aligned(8)));
+
 struct cr_hdr_task {
 	__u32 state;
 	__u32 exit_state;
-- 
1.5.4.3
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* [RFC v13][PATCH 14/14] Restart multiple processes
       [not found] ` <1233076092-8660-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2009-01-27 17:08   ` [RFC v13][PATCH 10/14] Restore open file descriprtors Oren Laadan
  2009-01-27 17:08   ` [RFC v13][PATCH 12/14] Track in-kernel when we expect checkpoint/restart to work Oren Laadan
@ 2009-01-27 17:08   ` Oren Laadan
  2 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-01-27 17:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Thomas Gleixner, Serge Hallyn, Dave Hansen, Ingo Molnar,
	H. Peter Anvin, Alexander Viro, Oren Laadan
Restarting of multiple processes expects all restarting tasks to call
sys_restart(). Once inside the system call, each task will restart
itself at the same order that they were saved. The internals of the
syscall will take care of in-kernel synchronization bewteen tasks.
This patch does _not_ create the task tree in the kernel. Instead it
assumes that all tasks are created in some way and then invoke the
restart syscall. You can use the userspace mktree.c program to do
that.
The init task (*) has a special role: it allocates the restart context
(ctx), and coordinates the operation. In particular, it first waits
until all participating tasks enter the kernel, and provides them the
common restart context. Once everyone in ready, it begins to restart
itself.
In contrast, the other tasks enter the kernel, locate the init task (*)
and grab its restart context, and then wait for their turn to restore.
When a task (init or not) completes its restart, it hands the control
over to the next in line, by waking that task.
An array of pids (the one saved during the checkpoint) is used to
synchronize the operation. The first task in the array is the init
task (*). The restart context (ctx) maintain a "current position" in
the array, which indicates which task is currently active. Once the
currently active task completes its own restart, it increments that
position and wakes up the next task.
Restart assumes that userspace provides meaningful data, otherwise
it's garbage-in-garbage-out. In this case, the syscall may block
indefinitely, but in TASK_INTERRUPTIBLE, so the user can ctrl-c or
otherwise kill the stray restarting tasks.
In terms of security, restart runs as the user the invokes it, so it
will not allow a user to do more than is otherwise permitted by the
usual system semantics and policy.
Currently we ignore threads and zombies, as well as session ids.
Add support for multiple processes
(*) For containers, restart should be called inside a fresh container
by the init task of that container. However, it is also possible to
restart applications not necessarily inside a container, and without
restoring the original pids of the processes (that is, provided that
the application can tolerate such behavior). This is useful to allow
multi-process restart of tasks not isolated inside a container, and
also for debugging.
Changelog[v13]:
  - Clear root_task->checkpoint_ctx regardless of error condition
  - Remove unused argument 'ctx' from do_restart_task() prototype
  - Remove unused member 'pids_err' from 'struct cr_ctx'
Changelog[v12]:
  - Replace obsolete cr_debug() with pr_debug()
Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
---
 checkpoint/restart.c       |  222 +++++++++++++++++++++++++++++++++++++++++++-
 checkpoint/sys.c           |   34 ++++++--
 include/linux/checkpoint.h |   22 ++++-
 include/linux/sched.h      |    1 +
 4 files changed, 265 insertions(+), 14 deletions(-)
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 0c46abf..7ec4de4 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -10,6 +10,7 @@
 
 #include <linux/version.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 #include <linux/file.h>
 #include <linux/magic.h>
 #include <linux/checkpoint.h>
@@ -276,30 +277,243 @@ static int cr_read_task(struct cr_ctx *ctx)
 	return ret;
 }
 
+/* cr_read_tree - read the tasks tree into the checkpoint context */
+static int cr_read_tree(struct cr_ctx *ctx)
+{
+	struct cr_hdr_tree *hh = cr_hbuf_get(ctx, sizeof(*hh));
+	int parent, size, ret = -EINVAL;
+
+	parent = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TREE);
+	if (parent < 0) {
+		ret = parent;
+		goto out;
+	} else if (parent != 0)
+		goto out;
+
+	if (hh->tasks_nr < 0)
+		goto out;
+
+	ctx->pids_nr = hh->tasks_nr;
+	size = sizeof(*ctx->pids_arr) * ctx->pids_nr;
+	if (size < 0)		/* overflow ? */
+		goto out;
+
+	ctx->pids_arr = kmalloc(size, GFP_KERNEL);
+	if (!ctx->pids_arr) {
+		ret = -ENOMEM;
+		goto out;
+	}
+	ret = cr_kread(ctx, ctx->pids_arr, size);
+ out:
+	cr_hbuf_put(ctx, sizeof(*hh));
+	return ret;
+}
+
+static int cr_wait_task(struct cr_ctx *ctx)
+{
+	pid_t pid = task_pid_vnr(current);
+
+	pr_debug("pid %d waiting\n", pid);
+	return wait_event_interruptible(ctx->waitq, ctx->pids_active == pid);
+}
+
+static int cr_next_task(struct cr_ctx *ctx)
+{
+	struct task_struct *tsk;
+
+	ctx->pids_pos++;
+
+	pr_debug("pids_pos %d %d\n", ctx->pids_pos, ctx->pids_nr);
+	if (ctx->pids_pos == ctx->pids_nr) {
+		complete(&ctx->complete);
+		return 0;
+	}
+
+	ctx->pids_active = ctx->pids_arr[ctx->pids_pos].vpid;
+
+	pr_debug("pids_next %d\n", ctx->pids_active);
+
+	rcu_read_lock();
+	tsk = find_task_by_pid_ns(ctx->pids_active, ctx->root_nsproxy->pid_ns);
+	if (tsk)
+		wake_up_process(tsk);
+	rcu_read_unlock();
+
+	if (!tsk) {
+		complete(&ctx->complete);
+		return -ESRCH;
+	}
+
+	return 0;
+}
+
+/* FIXME: this should be per container */
+DECLARE_WAIT_QUEUE_HEAD(cr_restart_waitq);
+
+static int do_restart_task(pid_t pid)
+{
+	struct task_struct *root_task;
+	struct cr_ctx *ctx = NULL;
+	int ret;
+
+	rcu_read_lock();
+	root_task = find_task_by_pid_ns(pid, current->nsproxy->pid_ns);
+	if (root_task)
+		get_task_struct(root_task);
+	rcu_read_unlock();
+
+	if (!root_task)
+		return -EINVAL;
+
+	/*
+	 * wait for container init to initialize the restart context, then
+	 * grab a reference to that context, and if we're the last task to
+	 * do it, notify the container init.
+	 */
+	ret = wait_event_interruptible(cr_restart_waitq,
+				       root_task->checkpoint_ctx);
+	if (ret < 0)
+		goto out;
+
+	task_lock(root_task);
+	ctx = root_task->checkpoint_ctx;
+	if (ctx)
+		cr_ctx_get(ctx);
+	task_unlock(root_task);
+
+	if (!ctx) {
+		ret = -EAGAIN;
+		goto out;
+	}
+
+	if (atomic_dec_and_test(&ctx->tasks_count))
+		complete(&ctx->complete);
+
+	/* wait for our turn, do the restore, and tell next task in line */
+	ret = cr_wait_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_task(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_next_task(ctx);
+
+ out:
+	cr_ctx_put(ctx);
+	put_task_struct(root_task);
+	return ret;
+}
+
+/**
+ * cr_wait_all_tasks_start - wait for all tasks to enter sys_restart()
+ * @ctx: checkpoint context
+ *
+ * Called by the container root to wait until all restarting tasks
+ * are ready to restore their state. Temporarily advertises the 'ctx'
+ * on 'current->checkpoint_ctx' so that others can grab a reference
+ * to it, and clears it once synchronization completes. See also the
+ * related code in do_restart_task().
+ */
+static int cr_wait_all_tasks_start(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+	current->checkpoint_ctx = ctx;
+
+	wake_up_all(&cr_restart_waitq);
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+
+	task_lock(current);
+	current->checkpoint_ctx = NULL;
+	task_unlock(current);
+
+	return ret;
+}
+
+static int cr_wait_all_tasks_finish(struct cr_ctx *ctx)
+{
+	int ret;
+
+	if (ctx->pids_nr == 1)
+		return 0;
+
+	init_completion(&ctx->complete);
+
+	ret = cr_next_task(ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = wait_for_completion_interruptible(&ctx->complete);
+	if (ret < 0)
+		return ret;
+
+	return 0;
+}
+
 /* setup restart-specific parts of ctx */
 static int cr_ctx_restart(struct cr_ctx *ctx, pid_t pid)
 {
+	ctx->root_pid = pid;
+	ctx->root_task = current;
+	ctx->root_nsproxy = current->nsproxy;
+
+	get_task_struct(ctx->root_task);
+	get_nsproxy(ctx->root_nsproxy);
+
+	atomic_set(&ctx->tasks_count, ctx->pids_nr - 1);
+
 	return 0;
 }
 
-int do_restart(struct cr_ctx *ctx, pid_t pid)
+static int do_restart_root(struct cr_ctx *ctx, pid_t pid)
 {
 	int ret;
 
+	ret = cr_read_head(ctx);
+	if (ret < 0)
+		goto out;
+	ret = cr_read_tree(ctx);
+	if (ret < 0)
+		goto out;
+
 	ret = cr_ctx_restart(ctx, pid);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_head(ctx);
+
+	/* wait for all other tasks to enter do_restart_task() */
+	ret = cr_wait_all_tasks_start(ctx);
 	if (ret < 0)
 		goto out;
+
 	ret = cr_read_task(ctx);
 	if (ret < 0)
 		goto out;
-	ret = cr_read_tail(ctx);
+
+	/* wait for all other tasks to complete do_restart_task() */
+	ret = cr_wait_all_tasks_finish(ctx);
 	if (ret < 0)
 		goto out;
 
-	/* on success, adjust the return value if needed [TODO] */
+	ret = cr_read_tail(ctx);
+
  out:
 	return ret;
 }
+
+int do_restart(struct cr_ctx *ctx, pid_t pid)
+{
+	int ret;
+
+	if (ctx)
+		ret = do_restart_root(ctx, pid);
+	else
+		ret = do_restart_task(pid);
+
+	/* on success, adjust the return value if needed [TODO] */
+	return ret;
+}
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index 0436ef3..f26b0c6 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -167,6 +167,8 @@ static void cr_task_arr_free(struct cr_ctx *ctx)
 
 static void cr_ctx_free(struct cr_ctx *ctx)
 {
+	BUG_ON(atomic_read(&ctx->refcount));
+
 	if (ctx->file)
 		fput(ctx->file);
 
@@ -185,6 +187,8 @@ static void cr_ctx_free(struct cr_ctx *ctx)
 	if (ctx->root_task)
 		put_task_struct(ctx->root_task);
 
+	kfree(ctx->pids_arr);
+
 	kfree(ctx);
 }
 
@@ -199,8 +203,10 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 
 	ctx->flags = flags;
 
+	atomic_set(&ctx->refcount, 0);
 	INIT_LIST_HEAD(&ctx->pgarr_list);
 	INIT_LIST_HEAD(&ctx->pgarr_pool);
+	init_waitqueue_head(&ctx->waitq);
 
 	err = -EBADF;
 	ctx->file = fget(fd);
@@ -215,6 +221,7 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	if (cr_objhash_alloc(ctx) < 0)
 		goto err;
 
+	atomic_inc(&ctx->refcount);
 	return ctx;
 
  err:
@@ -222,6 +229,17 @@ static struct cr_ctx *cr_ctx_alloc(int fd, unsigned long flags)
 	return ERR_PTR(err);
 }
 
+void cr_ctx_get(struct cr_ctx *ctx)
+{
+	atomic_inc(&ctx->refcount);
+}
+
+void cr_ctx_put(struct cr_ctx *ctx)
+{
+	if (ctx && atomic_dec_and_test(&ctx->refcount))
+		cr_ctx_free(ctx);
+}
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -249,7 +267,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
 	if (!ret)
 		ret = ctx->crid;
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
 
@@ -264,7 +282,7 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
  */
 asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 {
-	struct cr_ctx *ctx;
+	struct cr_ctx *ctx = NULL;
 	pid_t pid;
 	int ret;
 
@@ -272,15 +290,17 @@ asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
 	if (flags)
 		return -EINVAL;
 
-	ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
-	if (IS_ERR(ctx))
-		return PTR_ERR(ctx);
-
 	/* FIXME: for now, we use 'crid' as a pid */
 	pid = (pid_t) crid;
 
+	if (pid == task_pid_vnr(current))
+		ctx = cr_ctx_alloc(fd, flags | CR_CTX_RSTR);
+
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+
 	ret = do_restart(ctx, pid);
 
-	cr_ctx_free(ctx);
+	cr_ctx_put(ctx);
 	return ret;
 }
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 86fcec9..217cf6e 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -13,10 +13,11 @@
 #include <linux/fs.h>
 #include <linux/path.h>
 #include <linux/sched.h>
+#include <asm/atomic.h>
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 
-#define CR_VERSION  2
+#define CR_VERSION  3
 
 struct cr_ctx {
 	int crid;		/* unique checkpoint id */
@@ -34,8 +35,7 @@ struct cr_ctx {
 	void *hbuf;		/* temporary buffer for headers */
 	int hpos;		/* position in headers buffer */
 
-	struct task_struct **tasks_arr;	/* array of all tasks in container */
-	int tasks_nr;			/* size of tasks array */
+	atomic_t refcount;
 
 	struct cr_objhash *objhash;	/* hash for shared objects */
 
@@ -43,6 +43,19 @@ struct cr_ctx {
 	struct list_head pgarr_pool;	/* pool of empty page arrays chain */
 
 	struct path fs_mnt;	/* container root (FIXME) */
+
+	/* [multi-process checkpoint] */
+	struct task_struct **tasks_arr; /* array of all tasks [checkpoint] */
+	int tasks_nr;                   /* size of tasks array */
+
+	/* [multi-process restart] */
+	struct cr_hdr_pids *pids_arr;	/* array of all pids [restart] */
+	int pids_nr;			/* size of pids array */
+	int pids_pos;			/* position pids array */
+	pid_t pids_active;		/* pid of (next) active task */
+	atomic_t tasks_count;		/* sync of tasks: used to coordinate */
+	struct completion complete;	/* container root and other tasks on */
+	wait_queue_head_t waitq;	/* start, end, and restart ordering */
 };
 
 /* cr_ctx: flags */
@@ -55,6 +68,9 @@ extern int cr_kread(struct cr_ctx *ctx, void *buf, int count);
 extern void *cr_hbuf_get(struct cr_ctx *ctx, int n);
 extern void cr_hbuf_put(struct cr_ctx *ctx, int n);
 
+extern void cr_ctx_get(struct cr_ctx *ctx);
+extern void cr_ctx_put(struct cr_ctx *ctx);
+
 /* shared objects handling */
 
 enum {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index faa2ec6..0150e90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1359,6 +1359,7 @@ struct task_struct {
 
 #ifdef CONFIG_CHECKPOINT_RESTART
 	atomic_t may_checkpoint;
+	struct cr_ctx *checkpoint_ctx;
 #endif
 };
 
-- 
1.5.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 01/14] Create syscalls: sys_checkpoint, sys_restart
  2009-01-27 17:07 ` [RFC v13][PATCH 01/14] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2009-01-27 17:20   ` Randy Dunlap
  0 siblings, 0 replies; 121+ messages in thread
From: Randy Dunlap @ 2009-01-27 17:20 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, Linus Torvalds, containers, linux-kernel, linux-mm,
	linux-api, Thomas Gleixner, Serge Hallyn, Dave Hansen,
	Ingo Molnar, H. Peter Anvin, Alexander Viro
Oren Laadan wrote:
> Changelog[v5]:
>   - Config is 'def_bool n' by default
That's true by default; it doesn't have to be written/typed.
> Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
> Acked-by: Serge Hallyn <serue@us.ibm.com>
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
>  arch/x86/include/asm/unistd_32.h   |    2 +
>  arch/x86/kernel/syscall_table_32.S |    2 +
>  checkpoint/Kconfig                 |   11 +++++++++
>  checkpoint/Makefile                |    5 ++++
>  checkpoint/sys.c                   |   41 ++++++++++++++++++++++++++++++++++++
>  include/linux/syscalls.h           |    2 +
>  init/Kconfig                       |    2 +
>  kernel/sys_ni.c                    |    4 +++
>  8 files changed, 69 insertions(+), 0 deletions(-)
>  create mode 100644 checkpoint/Kconfig
>  create mode 100644 checkpoint/Makefile
>  create mode 100644 checkpoint/sys.c
> 
> diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
> index f2bba78..a5f9e09 100644
> --- a/arch/x86/include/asm/unistd_32.h
> +++ b/arch/x86/include/asm/unistd_32.h
> @@ -338,6 +338,8 @@
>  #define __NR_dup3		330
>  #define __NR_pipe2		331
>  #define __NR_inotify_init1	332
> +#define __NR_checkpoint		333
> +#define __NR_restart		334
>  
>  #ifdef __KERNEL__
>  
> diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
> index d44395f..5543136 100644
> --- a/arch/x86/kernel/syscall_table_32.S
> +++ b/arch/x86/kernel/syscall_table_32.S
> @@ -332,3 +332,5 @@ ENTRY(sys_call_table)
>  	.long sys_dup3			/* 330 */
>  	.long sys_pipe2
>  	.long sys_inotify_init1
> +	.long sys_checkpoint
> +	.long sys_restart
> diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> new file mode 100644
> index 0000000..375129c
> --- /dev/null
> +++ b/checkpoint/sys.c
> @@ -0,0 +1,41 @@
> +/*
> + *  Generic container checkpoint-restart
> + *
> + *  Copyright (C) 2008 Oren Laadan
> + *
> + *  This file is subject to the terms and conditions of the GNU General Public
> + *  License.  See the file COPYING in the main directory of the Linux
> + *  distribution for more details.
> + */
> +
#include <linux/syscalls.h>
and then use the new syscall definition macros.
See SYSCALL_DEFINE* in kernel/*.c (current git tree) for examples.
> +#include <linux/sched.h>
> +#include <linux/kernel.h>
> +
> +/**
> + * sys_checkpoint - checkpoint a container
> + * @pid: pid of the container init(1) process
> + * @fd: file to which dump the checkpoint image
> + * @flags: checkpoint operation flags
> + *
> + * Returns positive identifier on success, 0 when returning from restart
> + * or negative value on error
> + */
> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags)
> +{
> +	pr_debug("sys_checkpoint not implemented yet\n");
> +	return -ENOSYS;
> +}
> +/**
> + * sys_restart - restart a container
> + * @crid: checkpoint image identifier
> + * @fd: file from which read the checkpoint image
> + * @flags: restart operation flags
> + *
> + * Returns negative value on error, or otherwise returns in the realm
> + * of the original checkpoint
> + */
> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags)
> +{
> +	pr_debug("sys_restart not implemented yet\n");
> +	return -ENOSYS;
> +}
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 04fb47b..9750393 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -621,6 +621,8 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>  asmlinkage long sys_eventfd(unsigned int count);
>  asmlinkage long sys_eventfd2(unsigned int count, int flags);
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
> +asmlinkage long sys_restart(int crid, int fd, unsigned long flags);
>  
>  int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
>  
> diff --git a/init/Kconfig b/init/Kconfig
> index f763762..57364fe 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -814,6 +814,8 @@ config MARKERS
>  
>  source "arch/Kconfig"
>  
> +source "checkpoint/Kconfig"
> +
>  endmenu		# General setup
>  
>  config HAVE_GENERIC_DMA_COHERENT
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index e14a232..fcd65cc 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -174,3 +174,7 @@ cond_syscall(compat_sys_timerfd_settime);
>  cond_syscall(compat_sys_timerfd_gettime);
>  cond_syscall(sys_eventfd);
>  cond_syscall(sys_eventfd2);
> +
> +/* checkpoint/restart */
> +cond_syscall(sys_checkpoint);
> +cond_syscall(sys_restart);
-- 
~Randy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
                   ` (11 preceding siblings ...)
       [not found] ` <1233076092-8660-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-02-10 17:05 ` Dave Hansen
  2009-02-11 22:14   ` Andrew Morton
  12 siblings, 1 reply; 121+ messages in thread
From: Dave Hansen @ 2009-02-10 17:05 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, linux-api, containers, linux-kernel, linux-mm,
	Linus Torvalds, Alexander Viro, H. Peter Anvin, Thomas Gleixner,
	Ingo Molnar
On Tue, 2009-01-27 at 12:07 -0500, Oren Laadan wrote:
> Checkpoint-restart (c/r): a couple of fixes in preparation for 64bit
> architectures, and a couple of fixes for bugss (comments from Serge
> Hallyn, Sudakvev Bhattiprolu and Nathan Lynch). Updated and tested
> against v2.6.28.
> 
> Aiming for -mm.
Is there anything that we're waiting on before these can go into -mm?  I
think the discussion on the first few patches has died down to almost
nothing.  They're pretty reviewed-out.  Do they need a run in -mm?  I
don't think linux-next is quite appropriate since they're not _quite_
aimed at mainline yet.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-10 17:05 ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
@ 2009-02-11 22:14   ` Andrew Morton
  2009-02-12  9:17     ` Ingo Molnar
  2009-02-12 18:11     ` Dave Hansen
  0 siblings, 2 replies; 121+ messages in thread
From: Andrew Morton @ 2009-02-11 22:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: orenl-eQaUEPhvms7ENvBUuze7eA, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	tglx-hfZtesqFncYOwBW4kG4KsQ, mingo-X9Un+BFzKDI
On Tue, 10 Feb 2009 09:05:47 -0800
Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> On Tue, 2009-01-27 at 12:07 -0500, Oren Laadan wrote:
> > Checkpoint-restart (c/r): a couple of fixes in preparation for 64bit
> > architectures, and a couple of fixes for bugss (comments from Serge
> > Hallyn, Sudakvev Bhattiprolu and Nathan Lynch). Updated and tested
> > against v2.6.28.
> > 
> > Aiming for -mm.
> 
> Is there anything that we're waiting on before these can go into -mm?  I
> think the discussion on the first few patches has died down to almost
> nothing.  They're pretty reviewed-out.  Do they need a run in -mm?  I
> don't think linux-next is quite appropriate since they're not _quite_
> aimed at mainline yet.
> 
I raised an issue a few months ago and got inconclusively waffled at. 
Let us revisit.
I am concerned that this implementation is a bit of a toy, and that we
don't know what a sufficiently complete implementation will look like. 
There is a risk that if we merge the toy we either:
a) end up having to merge unacceptably-expensive-to-maintain code to
   make it a non-toy or
b) decide not to merge the unacceptably-expensive-to-maintain code,
   leaving us with a toy or
c) simply cannot work out how to implement the missing functionality.
So perhaps we can proceed by getting you guys to fill out the following
paperwork:
- In bullet-point form, what features are present?
- In bullet-point form, what features are missing, and should be added?
- Is it possible to briefly sketch out the design of the to-be-added
  features?
For extra marks:
- Will any of this involve non-trivial serialisation of kernel
  objects?  If so, that's getting into the
  unacceptably-expensive-to-maintain space, I suspect.
- Does (or will) this feature also support process migration?  If
  not, I'd have thought this to be a showstopper.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-11 22:14   ` Andrew Morton
@ 2009-02-12  9:17     ` Ingo Molnar
       [not found]       ` <20090212091721.GB1888-X9Un+BFzKDI@public.gmane.org>
  2009-02-12 18:11     ` Dave Hansen
  1 sibling, 1 reply; 121+ messages in thread
From: Ingo Molnar @ 2009-02-12  9:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, orenl, linux-api, containers, linux-kernel, linux-mm,
	torvalds, viro, hpa, tglx
* Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 10 Feb 2009 09:05:47 -0800
> Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> 
> > On Tue, 2009-01-27 at 12:07 -0500, Oren Laadan wrote:
> > > Checkpoint-restart (c/r): a couple of fixes in preparation for 64bit
> > > architectures, and a couple of fixes for bugss (comments from Serge
> > > Hallyn, Sudakvev Bhattiprolu and Nathan Lynch). Updated and tested
> > > against v2.6.28.
> > > 
> > > Aiming for -mm.
> > 
> > Is there anything that we're waiting on before these can go into -mm?  I
> > think the discussion on the first few patches has died down to almost
> > nothing.  They're pretty reviewed-out.  Do they need a run in -mm?  I
> > don't think linux-next is quite appropriate since they're not _quite_
> > aimed at mainline yet.
> > 
> 
> I raised an issue a few months ago and got inconclusively waffled at. 
> Let us revisit.
> 
> I am concerned that this implementation is a bit of a toy, and that we
> don't know what a sufficiently complete implementation will look like. 
> There is a risk that if we merge the toy we either:
> 
> a) end up having to merge unacceptably-expensive-to-maintain code to
>    make it a non-toy or
> 
> b) decide not to merge the unacceptably-expensive-to-maintain code,
>    leaving us with a toy or
> 
> c) simply cannot work out how to implement the missing functionality.
> 
> 
> So perhaps we can proceed by getting you guys to fill out the following
> paperwork:
> 
> - In bullet-point form, what features are present?
It would be nice to get an honest, critical-thinking answer on this.
What is it good for right now, and what are the known weaknesses and
quirks you can think of. Declaring them upfront is a bonus - not talking
about them and us discovering them later at the patch integration stage
is a sure receipe for upstream grumpiness.
This is an absolutely major featue, touching each and every subsystem in
a very fundamental way. It is also a cool capability worth a bit of a
maintenance pain, so we'd like to see the pros and cons nicely enumerated,
to the best of your knowledge. Most of us are just as feature-happy at
heart as you folks are, so if it can be done sanely we are on your side.
For example, one of the critical corner points: can an app programmatically 
determine whether it can support checkpoint/restart safely? Are there 
warnings/signals/helpers in place that make it a well-defined space, and
make the implementation of missing features directly actionable?
( instead of: 'silent breakage' and a wishy-washy boundary between the
  working and non-working space. Without clear boundaries there's no
  clear dynamics that extends the 'working' space beyond the demo stage. )
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-11 22:14   ` Andrew Morton
  2009-02-12  9:17     ` Ingo Molnar
@ 2009-02-12 18:11     ` Dave Hansen
  2009-02-12 19:30       ` Matt Mackall
  2009-02-13 23:28       ` Andrew Morton
  1 sibling, 2 replies; 121+ messages in thread
From: Dave Hansen @ 2009-02-12 18:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, orenl, linux-api, containers, linux-kernel,
	linux-mm, torvalds, viro, hpa, Thomas Gleixner
On Wed, 2009-02-11 at 14:14 -0800, Andrew Morton wrote:
> On Tue, 10 Feb 2009 09:05:47 -0800
> Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> 
> > On Tue, 2009-01-27 at 12:07 -0500, Oren Laadan wrote:
> > > Checkpoint-restart (c/r): a couple of fixes in preparation for 64bit
> > > architectures, and a couple of fixes for bugss (comments from Serge
> > > Hallyn, Sudakvev Bhattiprolu and Nathan Lynch). Updated and tested
> > > against v2.6.28.
> > > 
> > > Aiming for -mm.
> > 
> > Is there anything that we're waiting on before these can go into -mm?  I
> > think the discussion on the first few patches has died down to almost
> > nothing.  They're pretty reviewed-out.  Do they need a run in -mm?  I
> > don't think linux-next is quite appropriate since they're not _quite_
> > aimed at mainline yet.
> > 
> 
> I raised an issue a few months ago and got inconclusively waffled at. 
> Let us revisit.
> 
> I am concerned that this implementation is a bit of a toy, and that we
> don't know what a sufficiently complete implementation will look like. 
> There is a risk that if we merge the toy we either:
> 
> a) end up having to merge unacceptably-expensive-to-maintain code to
>    make it a non-toy or
> 
> b) decide not to merge the unacceptably-expensive-to-maintain code,
>    leaving us with a toy or
> 
> c) simply cannot work out how to implement the missing functionality.
> 
> 
> So perhaps we can proceed by getting you guys to fill out the following
> paperwork:
> 
> - In bullet-point form, what features are present?
 * i386 arch is supported
 * processes can perform a "self-checkpoint" which means calling 
   sys_checkpoint() on itself as well as "external checkpoint" where
   one task checkpoints another.
 * supported fds:
   * "normal" files on the filesystem
   * both endpoints of a pipe are checkpointed, as are pipe contents
 * each process's memory map is saved
 * the contents of anonymous memory are saved
 * infrastructure for managing objects in the checkpoint which are
   "shared" by multiple users like fds or a SVSV semaphore, for instance
 * multiple processes may be checkpointed during a single sys_checkpoint()
> - In bullet-point form, what features are missing, and should be added?
 * support for more architectures than i386
 * file descriptors:
  * sockets (network, AF_UNIX, etc...)
  * devices files
  * shmfs, hugetlbfs
  * epoll
  * unlinked files
 * Filesystem state
  * contents of files
  * mount tree for individual processes
 * flock
 * threads and sessions
 * CPU and NUMA affinity
 * sys_remap_file_pages()
This is a very minimal list that is surely incomplete and sure to grow.
I think of it like kernel scalability.  Is scalability important?  Do we
want the whole kernel to scale?  Yes, and yes, of course.  *Does* every
single device and feature in the kernel scale?  No way.  Will it ever be
"done"?  No freakin' way!  But, the kernel is scalable on the workloads
that are important to people.
Checkpoint/restart is the same way.  We intend to make core kernel
functionality checkpointable first.  We'll move outwards from there as
we (and our users) deem things important, but we'll certainly never be
done.  
> - Is it possible to briefly sketch out the design of the to-be-added
>   features?
For architecture (and indeed processor variation) we need a look at how
and when its registers are saved on kernel entry as well as things like
32/64-bit processes  and mm_context considerations.  There is x86_64,
s390 and ppc work ongoing.  Those ports have required quite small
changes in the generic code, which is a good sign.
Each fd type will need to be worked on separately.  Device files will
generally have to be one-off.  /dev/null has no internal state at all.
But, work needs done for devices which may have had all kinds of
ioctl()s done on them. 
Unlinked files will need their contents stored in the checkpoint so that
they may be copied over during restart (say to a temporary file),
opened, and unlinked again.  Files on kernel-internal mounts will need
similar treatment (think 'pipe_mnt').
We expect the filesystem *contents* to be taken care of generally by
outside mechanisms like dm or btrfs snapshotting.  
For the filesystem namespace, we'll effectively need to export what we
already have in /proc/$pid/mountinfo.  
I'm going to punt on explaining the networking bits for now because I
think I'd be wasting your time.  There are a couple of other guys around
much more versed in that area.
> For extra marks:
> 
> - Will any of this involve non-trivial serialisation of kernel
>   objects?  If so, that's getting into the
>   unacceptably-expensive-to-maintain space, I suspect.
We have some structures that are certainly tied to the kernel-internal
ones.  However, we are certainly *not* simply writing kernel structures
to userspace.  We could do that with /dev/mem.  We are carefully pulling
out the minimal bits of information from the kernel structures that we
*need* to recreate the function of the structure at restart.  There is a
maintenance burden here but, so far, that burden is almost entirely in
checkpoint/*.c.  We intend to test this functionality thoroughly to
ensure that we don't regress once we have integrated it.
> - Does (or will) this feature also support process migration?  If
>   not, I'd have thought this to be a showstopper.
You mean moving processes between machines?  Yes, it certainly will.
That is one of the primary design goals.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
       [not found]       ` <20090212091721.GB1888-X9Un+BFzKDI@public.gmane.org>
@ 2009-02-12 18:11         ` Dave Hansen
  2009-02-12 20:48           ` Serge E. Hallyn
  2009-02-13 10:20           ` Ingo Molnar
  0 siblings, 2 replies; 121+ messages in thread
From: Dave Hansen @ 2009-02-12 18:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, orenl-eQaUEPhvms7ENvBUuze7eA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	tglx-hfZtesqFncYOwBW4kG4KsQ
On Thu, 2009-02-12 at 10:17 +0100, Ingo Molnar wrote:
> * Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> 
> > On Tue, 10 Feb 2009 09:05:47 -0800
> > Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> > 
> > > On Tue, 2009-01-27 at 12:07 -0500, Oren Laadan wrote:
> > > > Checkpoint-restart (c/r): a couple of fixes in preparation for 64bit
> > > > architectures, and a couple of fixes for bugss (comments from Serge
> > > > Hallyn, Sudakvev Bhattiprolu and Nathan Lynch). Updated and tested
> > > > against v2.6.28.
> > > > 
> > > > Aiming for -mm.
> > > 
> > > Is there anything that we're waiting on before these can go into -mm?  I
> > > think the discussion on the first few patches has died down to almost
> > > nothing.  They're pretty reviewed-out.  Do they need a run in -mm?  I
> > > don't think linux-next is quite appropriate since they're not _quite_
> > > aimed at mainline yet.
> > > 
> > 
> > I raised an issue a few months ago and got inconclusively waffled at. 
> > Let us revisit.
> > 
> > I am concerned that this implementation is a bit of a toy, and that we
> > don't know what a sufficiently complete implementation will look like. 
> > There is a risk that if we merge the toy we either:
> > 
> > a) end up having to merge unacceptably-expensive-to-maintain code to
> >    make it a non-toy or
> > 
> > b) decide not to merge the unacceptably-expensive-to-maintain code,
> >    leaving us with a toy or
> > 
> > c) simply cannot work out how to implement the missing functionality.
> > 
> > 
> > So perhaps we can proceed by getting you guys to fill out the following
> > paperwork:
> > 
> > - In bullet-point form, what features are present?
> 
> It would be nice to get an honest, critical-thinking answer on this.
> 
> What is it good for right now, and what are the known weaknesses and
> quirks you can think of. Declaring them upfront is a bonus - not talking
> about them and us discovering them later at the patch integration stage
> is a sure receipe for upstream grumpiness.
That's a fair enough point, and I do agree with you on it.
Right now, it is good for very little.  An app has to basically be
either specifically designed to work, or be pretty puny in its
capabilities.  Any fds that are open can only be restored if a simple
open();lseek(); would have been sufficient to get it back into a good
state.  The process must be single-threaded.  Shared memory, hugetlbfs,
VM_NONLINEAR are not supported.  
> For example, one of the critical corner points: can an app programmatically 
> determine whether it can support checkpoint/restart safely? Are there 
> warnings/signals/helpers in place that make it a well-defined space, and
> make the implementation of missing features directly actionable?
> 
> ( instead of: 'silent breakage' and a wishy-washy boundary between the
>   working and non-working space. Without clear boundaries there's no
>   clear dynamics that extends the 'working' space beyond the demo stage. )
Patch 12/14 is supposed to address this *concept*.  But, it hasn't been
carried through so that it currently works.  My expectation was that we
would go through and add things over time.  I'll go make sure I push it
to the point that it actually works for at least the simple test
programs that we have.
What I will probably do is something BKL-style.  Basically put a "this
can't be checkpointed" marker over most everything I can think of and
selectively remove it as we add features.  
-- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-12 18:11     ` Dave Hansen
@ 2009-02-12 19:30       ` Matt Mackall
  2009-02-12 19:42         ` Andrew Morton
  2009-02-12 22:57         ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
  2009-02-13 23:28       ` Andrew Morton
  1 sibling, 2 replies; 121+ messages in thread
From: Matt Mackall @ 2009-02-12 19:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Andrew Morton, orenl, linux-api, containers,
	linux-kernel, linux-mm, torvalds, viro, hpa, Thomas Gleixner
On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
> > - In bullet-point form, what features are missing, and should be added?
> 
>  * support for more architectures than i386
>  * file descriptors:
>   * sockets (network, AF_UNIX, etc...)
>   * devices files
>   * shmfs, hugetlbfs
>   * epoll
>   * unlinked files
>  * Filesystem state
>   * contents of files
>   * mount tree for individual processes
>  * flock
>  * threads and sessions
>  * CPU and NUMA affinity
>  * sys_remap_file_pages()
I think the real questions is: where are the dragons hiding? Some of
these are known to be hard. And some of them are critical checkpointing
typical applications. If you have plans or theories for implementing all
of the above, then great. But this list doesn't really give any sense of
whether we should be scared of what lurks behind those doors.
Some of these things we probably don't have to care too much about. For
instance, contents of files - these can legitimately change for a
running process. Open TCP/IP sockets can legitimately get reset as well.
But others are a bigger deal.
Also, what happens if I checkpoint a process in 2.6.30 and restore it in
2.6.31 which has an expanded idea of what should be restored? Do your
file formats handle this sort of forward compatibility or am I
restricted to one kernel?
-- 
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-12 19:30       ` Matt Mackall
@ 2009-02-12 19:42         ` Andrew Morton
  2009-02-12 21:51           ` What can OpenVZ do? Dave Hansen
  2009-02-12 22:57         ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
  1 sibling, 1 reply; 121+ messages in thread
From: Andrew Morton @ 2009-02-12 19:42 UTC (permalink / raw)
  To: Matt Mackall
  Cc: dave, mingo, orenl, linux-api, containers, linux-kernel, linux-mm,
	torvalds, viro, hpa, tglx
On Thu, 12 Feb 2009 13:30:35 -0600
Matt Mackall <mpm@selenic.com> wrote:
> On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
> 
> > > - In bullet-point form, what features are missing, and should be added?
> > 
> >  * support for more architectures than i386
> >  * file descriptors:
> >   * sockets (network, AF_UNIX, etc...)
> >   * devices files
> >   * shmfs, hugetlbfs
> >   * epoll
> >   * unlinked files
> 
> >  * Filesystem state
> >   * contents of files
> >   * mount tree for individual processes
> >  * flock
> >  * threads and sessions
> >  * CPU and NUMA affinity
> >  * sys_remap_file_pages()
> 
> I think the real questions is: where are the dragons hiding? Some of
> these are known to be hard. And some of them are critical checkpointing
> typical applications. If you have plans or theories for implementing all
> of the above, then great. But this list doesn't really give any sense of
> whether we should be scared of what lurks behind those doors.
How close has OpenVZ come to implementing all of this?  I think the
implementatation is fairly complete?
If so, perhaps that can be used as a guide.  Will the planned feature
have a similar design?  If not, how will it differ?  To what extent can
we use that implementation as a tool for understanding what this new
implementation will look like?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-12 18:11         ` Dave Hansen
@ 2009-02-12 20:48           ` Serge E. Hallyn
  2009-02-13 10:20           ` Ingo Molnar
  1 sibling, 0 replies; 121+ messages in thread
From: Serge E. Hallyn @ 2009-02-12 20:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ
Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> Patch 12/14 is supposed to address this *concept*.  But, it hasn't been
> carried through so that it currently works.  My expectation was that we
> would go through and add things over time.  I'll go make sure I push it
> to the point that it actually works for at least the simple test
> programs that we have.
> 
> What I will probably do is something BKL-style.  Basically put a "this
> can't be checkpointed" marker over most everything I can think of and
> selectively remove it as we add features.  
So the question is: when can we unset the uncheckpointable flag?
In your patch you suggest clone(CLONE_NEWPID).  But that would
require that we at that point do a slew of checks for other
things like open files of a type which are not supported.
I'm wondering whether we should instead stick to calculating
whether a task is checkpointable or not at checkpoint time.
To help an application figure out whether it can be checkpointed,
we can hook /proc/$$/checkpointable to the same function, and
have the file output list all of the reasons the task is not
checkpointable.  i.e.
	mmap MAP_SHARED file which is not yet supported
	open file from another mounts namespace
	open TCP socket which is not yet supported
	open epoll fd which is not yet supported
	TASK NOT FROZEN
So now every time we do a checkpoint we have to do all these
checks, but that's better than at clone time.
You suggested on irc having a fops->is_checkpointable()
fn, which is imo a good idea to help implement the above.
The default value can be a fn returning false.  I suppose
we want to pass back a char* with the file type as well.
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* What can OpenVZ do?
  2009-02-12 19:42         ` Andrew Morton
@ 2009-02-12 21:51           ` Dave Hansen
  2009-02-12 22:10             ` Andrew Morton
                               ` (2 more replies)
  0 siblings, 3 replies; 121+ messages in thread
From: Dave Hansen @ 2009-02-12 21:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matt Mackall, containers, hpa, linux-kernel, linux-mm, viro,
	linux-api, mingo, torvalds, tglx, Pavel Emelyanov
On Thu, 2009-02-12 at 11:42 -0800, Andrew Morton wrote:
> On Thu, 12 Feb 2009 13:30:35 -0600
> Matt Mackall <mpm@selenic.com> wrote:
> 
> > On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
> > 
> > > > - In bullet-point form, what features are missing, and should be added?
> > > 
> > >  * support for more architectures than i386
> > >  * file descriptors:
> > >   * sockets (network, AF_UNIX, etc...)
> > >   * devices files
> > >   * shmfs, hugetlbfs
> > >   * epoll
> > >   * unlinked files
> > 
> > >  * Filesystem state
> > >   * contents of files
> > >   * mount tree for individual processes
> > >  * flock
> > >  * threads and sessions
> > >  * CPU and NUMA affinity
> > >  * sys_remap_file_pages()
> > 
> > I think the real questions is: where are the dragons hiding? Some of
> > these are known to be hard. And some of them are critical checkpointing
> > typical applications. If you have plans or theories for implementing all
> > of the above, then great. But this list doesn't really give any sense of
> > whether we should be scared of what lurks behind those doors.
> 
> How close has OpenVZ come to implementing all of this?  I think the
> implementatation is fairly complete?
I also believe it is "fairly complete".  At least able to be used
practically.
> If so, perhaps that can be used as a guide.  Will the planned feature
> have a similar design?  If not, how will it differ?  To what extent can
> we use that implementation as a tool for understanding what this new
> implementation will look like?
Yes, we can certainly use it as a guide.  However, there are some
barriers to being able to do that:
dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
 628 files changed, 59597 insertions(+), 2927 deletions(-)
dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
  84887  290855 2308745
Unfortunately, the git tree doesn't have that great of a history.  It
appears that the forward-ports are just applications of huge single
patches which then get committed into git.  This tree has also
historically contained a bunch of stuff not directly related to
checkpoint/restart like resource management.
We'd be idiots not to take a hard look at what has been done in OpenVZ.
But, for the time being, we have absolutely no shortage of things that
we know are important and know have to be done.  Our largest problem is
not finding things to do, but is our large out-of-tree patch that is
growing by the day. :(
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-12 21:51           ` What can OpenVZ do? Dave Hansen
@ 2009-02-12 22:10             ` Andrew Morton
  2009-02-12 23:04               ` How much of a mess does OpenVZ make? ;) Was: " Dave Hansen
  2009-02-13 10:53               ` Ingo Molnar
  2009-02-12 22:17             ` What can OpenVZ do? Alexey Dobriyan
  2009-02-13 10:27             ` Ingo Molnar
  2 siblings, 2 replies; 121+ messages in thread
From: Andrew Morton @ 2009-02-12 22:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: mpm, containers, hpa, linux-kernel, linux-mm, viro, linux-api,
	mingo, torvalds, tglx, xemul
On Thu, 12 Feb 2009 13:51:23 -0800
Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> On Thu, 2009-02-12 at 11:42 -0800, Andrew Morton wrote:
> > On Thu, 12 Feb 2009 13:30:35 -0600
> > Matt Mackall <mpm@selenic.com> wrote:
> > 
> > > On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
> > > 
> > > > > - In bullet-point form, what features are missing, and should be added?
> > > > 
> > > >  * support for more architectures than i386
> > > >  * file descriptors:
> > > >   * sockets (network, AF_UNIX, etc...)
> > > >   * devices files
> > > >   * shmfs, hugetlbfs
> > > >   * epoll
> > > >   * unlinked files
> > > 
> > > >  * Filesystem state
> > > >   * contents of files
> > > >   * mount tree for individual processes
> > > >  * flock
> > > >  * threads and sessions
> > > >  * CPU and NUMA affinity
> > > >  * sys_remap_file_pages()
> > > 
> > > I think the real questions is: where are the dragons hiding? Some of
> > > these are known to be hard. And some of them are critical checkpointing
> > > typical applications. If you have plans or theories for implementing all
> > > of the above, then great. But this list doesn't really give any sense of
> > > whether we should be scared of what lurks behind those doors.
> > 
> > How close has OpenVZ come to implementing all of this?  I think the
> > implementatation is fairly complete?
> 
> I also believe it is "fairly complete".  At least able to be used
> practically.
> 
> > If so, perhaps that can be used as a guide.  Will the planned feature
> > have a similar design?  If not, how will it differ?  To what extent can
> > we use that implementation as a tool for understanding what this new
> > implementation will look like?
> 
> Yes, we can certainly use it as a guide.  However, there are some
> barriers to being able to do that:
> 
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
>  628 files changed, 59597 insertions(+), 2927 deletions(-)
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
>   84887  290855 2308745
> 
> Unfortunately, the git tree doesn't have that great of a history.  It
> appears that the forward-ports are just applications of huge single
> patches which then get committed into git.  This tree has also
> historically contained a bunch of stuff not directly related to
> checkpoint/restart like resource management.
> 
> We'd be idiots not to take a hard look at what has been done in OpenVZ.
> But, for the time being, we have absolutely no shortage of things that
> we know are important and know have to be done.  Our largest problem is
> not finding things to do, but is our large out-of-tree patch that is
> growing by the day. :(
> 
Well we have a chicken-and-eggish thing.  The patchset will keep
growing until we understand how much of this:
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
>  628 files changed, 59597 insertions(+), 2927 deletions(-)
we will be committed to if we were to merge the current patchset.
Now, we've gone in blind before - most notably on the
containers/cgroups/namespaces stuff.  That hail mary pass worked out
acceptably, I think.  Maybe we got lucky.  I thought that
net-namespaces in particular would never get there, but it did.
That was a very large and quite long-term-important user-visible
feature.
checkpoint/restart/migration is also a long-term-...-feature.  But if
at all possible I do think that we should go into it with our eyes a
little less shut.
Interestingly, there was also prior-art for
containers/cgroups/namespaces within OpenVZ.  But we decided up-front
(I think) that the eventual implementation would have little in common
with preceding implementations.
Oh, and I'd disagree with your new Subject:.  It's pretty easy to find
out what OpenVZ can do.  The more important question here is "how much
of a mess did it make when it did it?"
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-12 21:51           ` What can OpenVZ do? Dave Hansen
  2009-02-12 22:10             ` Andrew Morton
@ 2009-02-12 22:17             ` Alexey Dobriyan
  2009-02-13 10:27             ` Ingo Molnar
  2 siblings, 0 replies; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-12 22:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Matt Mackall, containers, hpa, linux-kernel,
	linux-mm, viro, linux-api, mingo, torvalds, tglx, Pavel Emelyanov
On Thu, Feb 12, 2009 at 01:51:23PM -0800, Dave Hansen wrote:
> On Thu, 2009-02-12 at 11:42 -0800, Andrew Morton wrote:
> > On Thu, 12 Feb 2009 13:30:35 -0600
> > Matt Mackall <mpm@selenic.com> wrote:
> > 
> > > On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
> > > 
> > > > > - In bullet-point form, what features are missing, and should be added?
> > > > 
> > > >  * support for more architectures than i386
> > > >  * file descriptors:
> > > >   * sockets (network, AF_UNIX, etc...)
> > > >   * devices files
> > > >   * shmfs, hugetlbfs
> > > >   * epoll
> > > >   * unlinked files
> > > 
> > > >  * Filesystem state
> > > >   * contents of files
> > > >   * mount tree for individual processes
> > > >  * flock
> > > >  * threads and sessions
> > > >  * CPU and NUMA affinity
> > > >  * sys_remap_file_pages()
> > > 
> > > I think the real questions is: where are the dragons hiding? Some of
> > > these are known to be hard. And some of them are critical checkpointing
> > > typical applications. If you have plans or theories for implementing all
> > > of the above, then great. But this list doesn't really give any sense of
> > > whether we should be scared of what lurks behind those doors.
> > 
> > How close has OpenVZ come to implementing all of this?  I think the
> > implementatation is fairly complete?
> 
> I also believe it is "fairly complete".  At least able to be used
> practically.
> 
> > If so, perhaps that can be used as a guide.  Will the planned feature
> > have a similar design?  If not, how will it differ?  To what extent can
> > we use that implementation as a tool for understanding what this new
> > implementation will look like?
> 
> Yes, we can certainly use it as a guide.  However, there are some
> barriers to being able to do that:
> 
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
>  628 files changed, 59597 insertions(+), 2927 deletions(-)
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
>   84887  290855 2308745
	git-diff -- kernel/cpt/
should give more realistic picture.
> Unfortunately, the git tree doesn't have that great of a history.  It
> appears that the forward-ports are just applications of huge single
> patches which then get committed into git.  This tree has also
> historically contained a bunch of stuff not directly related to
> checkpoint/restart like resource management.
> We'd be idiots not to take a hard look at what has been done in OpenVZ.
> But, for the time being, we have absolutely no shortage of things that
> we know are important and know have to be done.  Our largest problem is
> not finding things to do, but is our large out-of-tree patch that is
> growing by the day. :(
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-12 19:30       ` Matt Mackall
  2009-02-12 19:42         ` Andrew Morton
@ 2009-02-12 22:57         ` Dave Hansen
  2009-02-12 23:05           ` Matt Mackall
  1 sibling, 1 reply; 121+ messages in thread
From: Dave Hansen @ 2009-02-12 22:57 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Andrew Morton, orenl, linux-api, containers,
	linux-kernel, linux-mm, torvalds, viro, hpa, Thomas Gleixner,
	Cedric Le Goater, Pavel Emelyanov, Alexey Dobriyan
On Thu, 2009-02-12 at 13:30 -0600, Matt Mackall wrote:
> On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
...
> >  * Filesystem state
> >   * contents of files
> >   * mount tree for individual processes
> >  * flock
> >  * threads and sessions
> >  * CPU and NUMA affinity
> >  * sys_remap_file_pages()
> 
> I think the real questions is: where are the dragons hiding? Some of
> these are known to be hard. And some of them are critical checkpointing
> typical applications. If you have plans or theories for implementing all
> of the above, then great. But this list doesn't really give any sense of
> whether we should be scared of what lurks behind those doors.
This is probably a better question for people like Pavel, Alexey and
Cedric to answer.  
> Some of these things we probably don't have to care too much about. For
> instance, contents of files - these can legitimately change for a
> running process. Open TCP/IP sockets can legitimately get reset as well.
> But others are a bigger deal.
Legitimately, yes.  But, practically, these are things that we need to
handle because we want to make any checkpoint/restart as transparent as
possible.  Resetting people's network connections is not exactly illegal
but not very nice or transparent either.
> Also, what happens if I checkpoint a process in 2.6.30 and restore it in
> 2.6.31 which has an expanded idea of what should be restored? Do your
> file formats handle this sort of forward compatibility or am I
> restricted to one kernel?
In general, you're restricted to one kernel.  But, people have mentioned
that, if the formats change, we should be able to write in-userspace
converters for the checkpoint files.  
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-12 22:10             ` Andrew Morton
@ 2009-02-12 23:04               ` Dave Hansen
  2009-02-26 15:57                 ` Alexey Dobriyan
  2009-02-26 16:27                 ` Alexey Dobriyan
  2009-02-13 10:53               ` Ingo Molnar
  1 sibling, 2 replies; 121+ messages in thread
From: Dave Hansen @ 2009-02-12 23:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mpm, containers, hpa, linux-kernel, linux-mm, viro, linux-api,
	mingo, torvalds, tglx, xemul, Alexey Dobriyan
On Thu, 2009-02-12 at 14:10 -0800, Andrew Morton wrote:
> On Thu, 12 Feb 2009 13:51:23 -0800
> Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> 
> > On Thu, 2009-02-12 at 11:42 -0800, Andrew Morton wrote:
> > > On Thu, 12 Feb 2009 13:30:35 -0600
> > > Matt Mackall <mpm@selenic.com> wrote:
> > > 
> > > > On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
> > > > 
> > > > > > - In bullet-point form, what features are missing, and should be added?
> > > > > 
> > > > >  * support for more architectures than i386
> > > > >  * file descriptors:
> > > > >   * sockets (network, AF_UNIX, etc...)
> > > > >   * devices files
> > > > >   * shmfs, hugetlbfs
> > > > >   * epoll
> > > > >   * unlinked files
> > > > 
> > > > >  * Filesystem state
> > > > >   * contents of files
> > > > >   * mount tree for individual processes
> > > > >  * flock
> > > > >  * threads and sessions
> > > > >  * CPU and NUMA affinity
> > > > >  * sys_remap_file_pages()
> > > > 
> > > > I think the real questions is: where are the dragons hiding? Some of
> > > > these are known to be hard. And some of them are critical checkpointing
> > > > typical applications. If you have plans or theories for implementing all
> > > > of the above, then great. But this list doesn't really give any sense of
> > > > whether we should be scared of what lurks behind those doors.
> > > 
> > > How close has OpenVZ come to implementing all of this?  I think the
> > > implementatation is fairly complete?
> > 
> > I also believe it is "fairly complete".  At least able to be used
> > practically.
> > 
> > > If so, perhaps that can be used as a guide.  Will the planned feature
> > > have a similar design?  If not, how will it differ?  To what extent can
> > > we use that implementation as a tool for understanding what this new
> > > implementation will look like?
> > 
> > Yes, we can certainly use it as a guide.  However, there are some
> > barriers to being able to do that:
> > 
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
> >  628 files changed, 59597 insertions(+), 2927 deletions(-)
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
> >   84887  290855 2308745
> > 
> > Unfortunately, the git tree doesn't have that great of a history.  It
> > appears that the forward-ports are just applications of huge single
> > patches which then get committed into git.  This tree has also
> > historically contained a bunch of stuff not directly related to
> > checkpoint/restart like resource management.
> > 
> > We'd be idiots not to take a hard look at what has been done in OpenVZ.
> > But, for the time being, we have absolutely no shortage of things that
> > we know are important and know have to be done.  Our largest problem is
> > not finding things to do, but is our large out-of-tree patch that is
> > growing by the day. :(
> > 
> 
> Well we have a chicken-and-eggish thing.  The patchset will keep
> growing until we understand how much of this:
> 
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
> >  628 files changed, 59597 insertions(+), 2927 deletions(-)
> 
> we will be committed to if we were to merge the current patchset.
Here's the measurement that Alexey suggested:
dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat 
 Makefile        |   53 +
 cpt_conntrack.c |  365 ++++++++++++
 cpt_context.c   |  257 ++++++++
 cpt_context.h   |  215 +++++++
 cpt_dump.c      | 1250 ++++++++++++++++++++++++++++++++++++++++++
 cpt_dump.h      |   16 
 cpt_epoll.c     |  113 +++
 cpt_exports.c   |   13 
 cpt_files.c     | 1626 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 cpt_files.h     |   71 ++
 cpt_fsmagic.h   |   16 
 cpt_inotify.c   |  144 ++++
 cpt_kernel.c    |  177 ++++++
 cpt_kernel.h    |   99 +++
 cpt_mm.c        |  923 +++++++++++++++++++++++++++++++
 cpt_mm.h        |   35 +
 cpt_net.c       |  614 ++++++++++++++++++++
 cpt_net.h       |    7 
 cpt_obj.c       |  162 +++++
 cpt_obj.h       |   62 ++
 cpt_proc.c      |  595 ++++++++++++++++++++
 cpt_process.c   | 1369 ++++++++++++++++++++++++++++++++++++++++++++++
 cpt_process.h   |   13 
 cpt_socket.c    |  790 ++++++++++++++++++++++++++
 cpt_socket.h    |   33 +
 cpt_socket_in.c |  450 +++++++++++++++
 cpt_syscalls.h  |  101 +++
 cpt_sysvipc.c   |  403 +++++++++++++
 cpt_tty.c       |  215 +++++++
 cpt_ubc.c       |  132 ++++
 cpt_ubc.h       |   23 
 cpt_x8664.S     |   67 ++
 rst_conntrack.c |  283 +++++++++
 rst_context.c   |  323 ++++++++++
 rst_epoll.c     |  169 +++++
 rst_files.c     | 1648 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 rst_inotify.c   |  196 ++++++
 rst_mm.c        | 1151 +++++++++++++++++++++++++++++++++++++++
 rst_net.c       |  741 +++++++++++++++++++++++++
 rst_proc.c      |  580 +++++++++++++++++++
 rst_process.c   | 1640 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 rst_socket.c    |  918 +++++++++++++++++++++++++++++++
 rst_socket_in.c |  489 ++++++++++++++++
 rst_sysvipc.c   |  633 +++++++++++++++++++++
 rst_tty.c       |  384 +++++++++++++
 rst_ubc.c       |  131 ++++
 rst_undump.c    | 1007 ++++++++++++++++++++++++++++++++++
 47 files changed, 20702 insertions(+)
One important thing that leaves out is the interaction that this code
has with the rest of the kernel.  That's critically important when
considering long-term maintenance, and I'd be curious how the OpenVZ
folks view it. 
> Now, we've gone in blind before - most notably on the
> containers/cgroups/namespaces stuff.  That hail mary pass worked out
> acceptably, I think.  Maybe we got lucky.  I thought that
> net-namespaces in particular would never get there, but it did.
> 
> That was a very large and quite long-term-important user-visible
> feature.
> 
> checkpoint/restart/migration is also a long-term-...-feature.  But if
> at all possible I do think that we should go into it with our eyes a
> little less shut.
One thing Ingo has asked for that I understand a bit more clearly is a
programmatic statement of what is and is not covered by this current
code.  That's certainly one eye-opening activity which I'll get to
immediately.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-12 22:57         ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
@ 2009-02-12 23:05           ` Matt Mackall
  2009-02-12 23:13             ` Dave Hansen
  0 siblings, 1 reply; 121+ messages in thread
From: Matt Mackall @ 2009-02-12 23:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Andrew Morton, orenl, linux-api, containers,
	linux-kernel, linux-mm, torvalds, viro, hpa, Thomas Gleixner,
	Cedric Le Goater, Pavel Emelyanov, Alexey Dobriyan
On Thu, 2009-02-12 at 14:57 -0800, Dave Hansen wrote:
> > Also, what happens if I checkpoint a process in 2.6.30 and restore it in
> > 2.6.31 which has an expanded idea of what should be restored? Do your
> > file formats handle this sort of forward compatibility or am I
> > restricted to one kernel?
> 
> In general, you're restricted to one kernel.  But, people have mentioned
> that, if the formats change, we should be able to write in-userspace
> converters for the checkpoint files.  
I mentioned this because it seems like a key use case is upgrading
kernels out from under long-lived applications.
-- 
http://selenic.com : development and support for Mercurial and Linux
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-12 23:05           ` Matt Mackall
@ 2009-02-12 23:13             ` Dave Hansen
  0 siblings, 0 replies; 121+ messages in thread
From: Dave Hansen @ 2009-02-12 23:13 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Ingo Molnar, Andrew Morton, orenl-eQaUEPhvms7ENvBUuze7eA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Thomas Gleixner, Cedric Le Goater, Pavel Emelyanov,
	Alexey Dobriyan
On Thu, 2009-02-12 at 17:05 -0600, Matt Mackall wrote:
> On Thu, 2009-02-12 at 14:57 -0800, Dave Hansen wrote:
> > > Also, what happens if I checkpoint a process in 2.6.30 and restore it in
> > > 2.6.31 which has an expanded idea of what should be restored? Do your
> > > file formats handle this sort of forward compatibility or am I
> > > restricted to one kernel?
> > 
> > In general, you're restricted to one kernel.  But, people have mentioned
> > that, if the formats change, we should be able to write in-userspace
> > converters for the checkpoint files.  
> 
> I mentioned this because it seems like a key use case is upgrading
> kernels out from under long-lived applications.
The key users as I envision it aren't really kernel hackers who are
always running 2.6-next and running radically different kernels from
moment to moment. :)
Distros are pretty picky about changing things internal to the kernel
during errata updates or even service packs.  While that can be a pain
for some of us developers trying to get features and fixes in, it is a
godsend for trying to do something like process migration across an
update.
My random speculation would be that for things that if a kernel upgrade
can be performed with ksplice (http://www.ksplice.com/) -- the original
non-fancy version at least -- we can probably migrate across the
upgrade.
-- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-12 18:11         ` Dave Hansen
  2009-02-12 20:48           ` Serge E. Hallyn
@ 2009-02-13 10:20           ` Ingo Molnar
  1 sibling, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-02-13 10:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, orenl, linux-api, containers, linux-kernel,
	linux-mm, torvalds, viro, hpa, tglx
* Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> > What is it good for right now, and what are the known weaknesses and
> > quirks you can think of. Declaring them upfront is a bonus - not talking
> > about them and us discovering them later at the patch integration stage
> > is a sure receipe for upstream grumpiness.
> 
> That's a fair enough point, and I do agree with you on it.
> 
> Right now, it is good for very little.  An app has to basically be
> either specifically designed to work, or be pretty puny in its
> capabilities.  Any fds that are open can only be restored if a simple
> open();lseek(); would have been sufficient to get it back into a good
> state.  The process must be single-threaded.  Shared memory, hugetlbfs,
> VM_NONLINEAR are not supported.  
That is OK as a starting point, as long as:
> > For example, one of the critical corner points: can an app programmatically 
> > determine whether it can support checkpoint/restart safely? Are there 
> > warnings/signals/helpers in place that make it a well-defined space, and
> > make the implementation of missing features directly actionable?
> > 
> > ( instead of: 'silent breakage' and a wishy-washy boundary between the
> >   working and non-working space. Without clear boundaries there's no
> >   clear dynamics that extends the 'working' space beyond the demo stage. )
> 
> Patch 12/14 is supposed to address this *concept*.  But, it hasn't been
> carried through so that it currently works.  My expectation was that we
> would go through and add things over time.  I'll go make sure I push it
> to the point that it actually works for at least the simple test
> programs that we have.
> 
> What I will probably do is something BKL-style.  Basically put a "this
> can't be checkpointed" marker over most everything I can think of and
> selectively remove it as we add features.  
An app really has to know whether it can reliably checkpoint+restart.
Otherwise it wont ever get past the toy stage and people will waste a
lot of time if their designed-for-checkpoints app accidentally runs
into some kernel feature or other side-effect that is not supported.
I personally wouldnt mind to sprinkle the kernel with markers, as long
as you can make it really cheap even with CONFIG_CHECKPOINT_RESTART=y.
Btw., i dont think it's all that much work, nor is it really intrusive:
have you thought of reusing all the existing security callbacks? You'd
have instant coverage of basically every system call and kernel
functionality that matters, and you could have a finegrained set of
policies.
The only drawback is that you have to enable CONFIG_SECURITY for it,
but in practice most distros enable that, so the callback overhead is
already there - you just have to enable it. (Also, some care has to
be taken to properly stack it to existing LSM modules, but that is
solvable too.)
Sidenote: CONFIG_CHECKPOINT_RESTART is IMO an uncomfortably long name,
i'd suggest to rename it to CONFIG_CHECKPOINTS or so. [the concept of a
checkpoint is good enough to mention - if there's a checkpoint then a
restart is logically implied.]
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-12 21:51           ` What can OpenVZ do? Dave Hansen
  2009-02-12 22:10             ` Andrew Morton
  2009-02-12 22:17             ` What can OpenVZ do? Alexey Dobriyan
@ 2009-02-13 10:27             ` Ingo Molnar
  2009-02-13 11:32               ` Alexey Dobriyan
  2 siblings, 1 reply; 121+ messages in thread
From: Ingo Molnar @ 2009-02-13 10:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Matt Mackall, containers, hpa, linux-kernel,
	linux-mm, viro, linux-api, torvalds, tglx, Pavel Emelyanov
* Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> > If so, perhaps that can be used as a guide.  Will the planned feature
> > have a similar design?  If not, how will it differ?  To what extent can
> > we use that implementation as a tool for understanding what this new
> > implementation will look like?
> 
> Yes, we can certainly use it as a guide.  However, there are some
> barriers to being able to do that:
> 
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
>  628 files changed, 59597 insertions(+), 2927 deletions(-)
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
>   84887  290855 2308745
> 
> Unfortunately, the git tree doesn't have that great of a history.  It
> appears that the forward-ports are just applications of huge single
> patches which then get committed into git.  This tree has also
> historically contained a bunch of stuff not directly related to
> checkpoint/restart like resource management.
Really, OpenVZ/Virtuozzo does not seem to have enough incentive to merge
upstream, they only seem to forward-port, keep their tree messy, do minimal
work to reduce the cross section to the rest of the kernel (so that they can
manage the forward ports) but otherwise are happy with their carved-out
niche market. [which niche is also spiced with some proprietary add-ons,
last i checked, not exactly the contribution environment that breeds a
healthy flow of patches towards the upstream kernel.]
Merging checkpoints instead might give them the incentive to get
their act together.
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-12 22:10             ` Andrew Morton
  2009-02-12 23:04               ` How much of a mess does OpenVZ make? ;) Was: " Dave Hansen
@ 2009-02-13 10:53               ` Ingo Molnar
       [not found]                 ` <20090213105302.GC4608-X9Un+BFzKDI@public.gmane.org>
  1 sibling, 1 reply; 121+ messages in thread
From: Ingo Molnar @ 2009-02-13 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, mpm, containers, hpa, linux-kernel, linux-mm, viro,
	linux-api, torvalds, tglx, xemul
* Andrew Morton <akpm@linux-foundation.org> wrote:
> Now, we've gone in blind before - most notably on the
> containers/cgroups/namespaces stuff.  That hail mary pass worked out
> acceptably, I think.  Maybe we got lucky.  I thought that
> net-namespaces in particular would never get there, but it did.
> 
> That was a very large and quite long-term-important user-visible
> feature.
> 
> checkpoint/restart/migration is also a long-term-...-feature.  But if
> at all possible I do think that we should go into it with our eyes a
> little less shut.
IMO, s/.../important/
More important than containers in fact. Being able to detach all
software state from the hw state and being able to reattach it:
   1) at a later point in time,                   or
   2) in a different piece of hardware,           or
   3) [future] in a different kernel
... is powerful stuff on a very conceptual level IMO.
The only reason we dont have it in every OS is not because it's not
desired and not wanted, but because it's very, very hard to do it on
a wide scale. But people would love it even if it adds (some) overhead.
This kind of featureset is actually the main motivator for virtualization.
If the native kernel was able to do checkpointing we'd have not only
near-zero-cost virtualization done at the right abstraction level
(when combined with containers/control-groups), but we'd also have
a few future feature items like:
  1) Kernel upgrades done intelligently: transparent reboot into an
     upgraded kernel.
  2) Downgrade-on-regressions done sanely: transparent downgrade+reboot
     to a known-working kernel. (as long as the regression is app
     misbehavior or a performance problem - not a kernel crash. Most
     regressions on kernel upgrades are not actual crashes or data
     corruption but functional and performance regressions - i.e. it's
     safely checkpointable and downgradeable.)
  3) Hibernation done intelligently: checkpoint everything, turn off
     system. Turn on system, restore everything from the checkpoint.
  4) Backups done intelligently: full "backups" of long-running
     computational jobs, maybe even of complex things like databases
     or desktop sessions.
  5) Remote debugging done intelligently: got a crashed session?
     Checkpoint the whole app in its anomalous state and upload the
     image (as long as you can trust the developer with that image
     and with the filesystem state that goes with it).
I dont see many long-term dragons here. The kernel is obviously always
able to do near-zero-overhead checkpointing: it knows about all its
own data structures, can enumerate them and knows how they map to
user-space objects.
The rest is performance considerations: do we want to embedd
checkpointing helpers in certain runtime codepaths, to make
checkpointing faster? But if that is undesirable (serialization,
etc.), we can always fall back to the dumbest, zero-overhead methods.
There is _one_ interim runtime cost: the "can we checkpoint or not"
decision that the kernel has to make while the feature is not complete.
That, if this feature takes off, is just a short-term worry - as
basically everything will be checkpointable in the long run.
In any case, by designing checkpointing to reuse the existing LSM
callbacks, we'd hit multiple birds with the same stone. (One of
which is the constant complaints about the runtime costs of the LSM
callbacks - with checkpointing we get an independent, non-security
user of the facility which is a nice touch.)
So all things considered it does not look like a bad deal to me - but
i might be missing something nasty.
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-13 10:27             ` Ingo Molnar
@ 2009-02-13 11:32               ` Alexey Dobriyan
  2009-02-13 11:45                 ` Ingo Molnar
  0 siblings, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-13 11:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, Andrew Morton, Matt Mackall, containers, hpa,
	linux-kernel, linux-mm, viro, linux-api, torvalds, tglx,
	Pavel Emelyanov
On Fri, Feb 13, 2009 at 11:27:32AM +0100, Ingo Molnar wrote:
> 
> * Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> 
> > > If so, perhaps that can be used as a guide.  Will the planned feature
> > > have a similar design?  If not, how will it differ?  To what extent can
> > > we use that implementation as a tool for understanding what this new
> > > implementation will look like?
> > 
> > Yes, we can certainly use it as a guide.  However, there are some
> > barriers to being able to do that:
> > 
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
> >  628 files changed, 59597 insertions(+), 2927 deletions(-)
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
> >   84887  290855 2308745
> > 
> > Unfortunately, the git tree doesn't have that great of a history.  It
> > appears that the forward-ports are just applications of huge single
> > patches which then get committed into git.  This tree has also
> > historically contained a bunch of stuff not directly related to
> > checkpoint/restart like resource management.
> 
> Really, OpenVZ/Virtuozzo does not seem to have enough incentive to merge
> upstream, they only seem to forward-port, keep their tree messy, do minimal
> work to reduce the cross section to the rest of the kernel (so that they can
> manage the forward ports) but otherwise are happy with their carved-out
> niche market. [which niche is also spiced with some proprietary add-ons,
> last i checked, not exactly the contribution environment that breeds a
> healthy flow of patches towards the upstream kernel.]
Oh, cut the crap!
> Merging checkpoints instead might give them the incentive to get
> their act together.
Knowing how much time it takes to beat CPT back into usable shape every time
big kernel rebase is done, OpenVZ/Virtuozzo have every single damn incentive
to have CPT mainlined.
If someone is afraid of long config options, there are always CONFIG_CPT and
CONFIG_CR available.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-13 11:32               ` Alexey Dobriyan
@ 2009-02-13 11:45                 ` Ingo Molnar
  2009-02-13 22:28                   ` Alexey Dobriyan
  0 siblings, 1 reply; 121+ messages in thread
From: Ingo Molnar @ 2009-02-13 11:45 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Dave Hansen, Andrew Morton, Matt Mackall, containers, hpa,
	linux-kernel, linux-mm, viro, linux-api, torvalds, tglx,
	Pavel Emelyanov
* Alexey Dobriyan <adobriyan@gmail.com> wrote:
> On Fri, Feb 13, 2009 at 11:27:32AM +0100, Ingo Molnar wrote:
> > 
> > * Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> > 
> > > > If so, perhaps that can be used as a guide.  Will the planned feature
> > > > have a similar design?  If not, how will it differ?  To what extent can
> > > > we use that implementation as a tool for understanding what this new
> > > > implementation will look like?
> > > 
> > > Yes, we can certainly use it as a guide.  However, there are some
> > > barriers to being able to do that:
> > > 
> > > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
> > >  628 files changed, 59597 insertions(+), 2927 deletions(-)
> > > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
> > >   84887  290855 2308745
> > > 
> > > Unfortunately, the git tree doesn't have that great of a history.  It
> > > appears that the forward-ports are just applications of huge single
> > > patches which then get committed into git.  This tree has also
> > > historically contained a bunch of stuff not directly related to
> > > checkpoint/restart like resource management.
> > 
> > Really, OpenVZ/Virtuozzo does not seem to have enough incentive to merge
> > upstream, they only seem to forward-port, keep their tree messy, do minimal
> > work to reduce the cross section to the rest of the kernel (so that they can
> > manage the forward ports) but otherwise are happy with their carved-out
> > niche market. [which niche is also spiced with some proprietary add-ons,
> > last i checked, not exactly the contribution environment that breeds a
> > healthy flow of patches towards the upstream kernel.]
> 
> Oh, cut the crap!
> 
> > Merging checkpoints instead might give them the incentive to get
> > their act together.
> 
> Knowing how much time it takes to beat CPT back into usable shape every time
> big kernel rebase is done, OpenVZ/Virtuozzo have every single damn incentive
> to have CPT mainlined.
So where is the bottleneck? I suspect the effort in having forward ported
it across 4 major kernel releases in a single year is already larger than
the technical effort it would  take to upstream it. Any unreasonable upstream 
resistence/passivity you are bumping into?
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-13 11:45                 ` Ingo Molnar
@ 2009-02-13 22:28                   ` Alexey Dobriyan
  2009-03-14  0:04                     ` Eric W. Biederman
  0 siblings, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-13 22:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, Andrew Morton, Matt Mackall, containers, hpa,
	linux-kernel, linux-mm, viro, linux-api, torvalds, tglx,
	Pavel Emelyanov
On Fri, Feb 13, 2009 at 12:45:03PM +0100, Ingo Molnar wrote:
> 
> * Alexey Dobriyan <adobriyan@gmail.com> wrote:
> 
> > On Fri, Feb 13, 2009 at 11:27:32AM +0100, Ingo Molnar wrote:
> > > Merging checkpoints instead might give them the incentive to get
> > > their act together.
> > 
> > Knowing how much time it takes to beat CPT back into usable shape every time
> > big kernel rebase is done, OpenVZ/Virtuozzo have every single damn incentive
> > to have CPT mainlined.
> 
> So where is the bottleneck? I suspect the effort in having forward ported
> it across 4 major kernel releases in a single year is already larger than
> the technical effort it would  take to upstream it. Any unreasonable upstream 
> resistence/passivity you are bumping into?
People were busy with netns/containers stuff and OpenVZ/Virtuozzo bugs.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-12 18:11     ` Dave Hansen
  2009-02-12 19:30       ` Matt Mackall
@ 2009-02-13 23:28       ` Andrew Morton
  2009-02-14 23:08         ` Ingo Molnar
                           ` (2 more replies)
  1 sibling, 3 replies; 121+ messages in thread
From: Andrew Morton @ 2009-02-13 23:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: mingo-X9Un+BFzKDI, orenl-eQaUEPhvms7ENvBUuze7eA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	tglx-hfZtesqFncYOwBW4kG4KsQ
On Thu, 12 Feb 2009 10:11:22 -0800
Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> 
> ...
>
> > - In bullet-point form, what features are missing, and should be added?
> 
>  * support for more architectures than i386
>  * file descriptors:
>   * sockets (network, AF_UNIX, etc...)
>   * devices files
>   * shmfs, hugetlbfs
>   * epoll
>   * unlinked files
>  * Filesystem state
>   * contents of files
>   * mount tree for individual processes
>  * flock
>  * threads and sessions
>  * CPU and NUMA affinity
>  * sys_remap_file_pages()
> 
> This is a very minimal list that is surely incomplete and sure to grow.
That's a worry.
> 
> > For extra marks:
> > 
> > - Will any of this involve non-trivial serialisation of kernel
> >   objects?  If so, that's getting into the
> >   unacceptably-expensive-to-maintain space, I suspect.
> 
> We have some structures that are certainly tied to the kernel-internal
> ones.  However, we are certainly *not* simply writing kernel structures
> to userspace.  We could do that with /dev/mem.  We are carefully pulling
> out the minimal bits of information from the kernel structures that we
> *need* to recreate the function of the structure at restart.  There is a
> maintenance burden here but, so far, that burden is almost entirely in
> checkpoint/*.c.  We intend to test this functionality thoroughly to
> ensure that we don't regress once we have integrated it.
I guess my question can be approximately simplified to: "will it end up
looking like openvz"?  (I don't believe that we know of any other way
of implementing this?)
Because if it does then that's a concern, because my assessment when I
looked at that code (a number of years ago) was that having code of
that nature in mainline would be pretty costly to us, and rather
unwelcome.
The broadest form of the question is "will we end up regretting having
done this".
If we can arrange for the implementation to sit quietly over in a
corner with a team of people maintaining it and not screwing up other
people's work then I guess we'd be OK - if it breaks then the breakage
is localised.
And it's not just a matter of "does the diffstat only affect a single
subdirectory".  We also should watch out for the imposition of new
rules which kernel code must follow.  "you can't do that, because we
can't serialise it", or something.
Similar to the way in which perfectly correct and normal kernel
sometimes has to be changed because it unexpectedly upsets the -rt
patch.
Do you expect that any restrictions of this type will be imposed?
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-13 23:28       ` Andrew Morton
@ 2009-02-14 23:08         ` Ingo Molnar
  2009-02-14 23:31           ` Andrew Morton
       [not found]         ` <20090213152836.0fbbfa7d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2009-03-13  2:45         ` Oren Laadan
  2 siblings, 1 reply; 121+ messages in thread
From: Ingo Molnar @ 2009-02-14 23:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, orenl, linux-api, containers, linux-kernel, linux-mm,
	torvalds, viro, hpa, tglx
* Andrew Morton <akpm@linux-foundation.org> wrote:
> Similar to the way in which perfectly correct and normal kernel
> sometimes has to be changed because it unexpectedly upsets the -rt
> patch.
Actually, regarding -rt, we try to keep that in two buckets:
 1) Normal kernel code works but is unclean or structured less
    than ideal. In this case we restructure the mainline code,
    but that change stands on its own four legs, without any
    -rt considerations.
 2) Normal kernel code that is clean - i.e. a change that only
    matters to -rt. In this case we dont touch the mainline code,
    nor do we bother mainline.
Do you know any specific example that falls outside of those categories?
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-14 23:08         ` Ingo Molnar
@ 2009-02-14 23:31           ` Andrew Morton
  2009-02-14 23:50             ` Ingo Molnar
  0 siblings, 1 reply; 121+ messages in thread
From: Andrew Morton @ 2009-02-14 23:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, orenl, linux-api, containers, linux-kernel, linux-mm,
	torvalds, viro, hpa, tglx
On Sun, 15 Feb 2009 00:08:02 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> 
> * Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > Similar to the way in which perfectly correct and normal kernel
> > sometimes has to be changed because it unexpectedly upsets the -rt
> > patch.
> 
> Actually, regarding -rt, we try to keep that in two buckets:
> 
>  1) Normal kernel code works but is unclean or structured less
>     than ideal. In this case we restructure the mainline code,
>     but that change stands on its own four legs, without any
>     -rt considerations.
> 
>  2) Normal kernel code that is clean - i.e. a change that only
>     matters to -rt. In this case we dont touch the mainline code,
>     nor do we bother mainline.
> 
> Do you know any specific example that falls outside of those categories?
> 
It happens fairly regularly.  Problems with irqs-off regions, problems
with preempt_disable() regions (came up just yesterday with a patch from
Jeremy).
Plus some convert-to-sleeping-lock conversions over the years which
weren't obviously needed in mainline.  Or which at least had -rt
motivations.  But that's different.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-14 23:31           ` Andrew Morton
@ 2009-02-14 23:50             ` Ingo Molnar
  0 siblings, 0 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-02-14 23:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, orenl, linux-api, containers, linux-kernel, linux-mm,
	torvalds, viro, hpa, tglx
* Andrew Morton <akpm@linux-foundation.org> wrote:
> On Sun, 15 Feb 2009 00:08:02 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > * Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > Similar to the way in which perfectly correct and normal kernel
> > > sometimes has to be changed because it unexpectedly upsets the -rt
> > > patch.
> > 
> > Actually, regarding -rt, we try to keep that in two buckets:
> > 
> >  1) Normal kernel code works but is unclean or structured less
> >     than ideal. In this case we restructure the mainline code,
> >     but that change stands on its own four legs, without any
> >     -rt considerations.
> > 
> >  2) Normal kernel code that is clean - i.e. a change that only
> >     matters to -rt. In this case we dont touch the mainline code,
> >     nor do we bother mainline.
> > 
> > Do you know any specific example that falls outside of those categories?
> > 
> 
> It happens fairly regularly.  Problems with irqs-off regions, problems
> with preempt_disable() regions (came up just yesterday with a patch from
> Jeremy).
As Peter has stated it in that thread, throwing around preempt_disable()s
is considered anti-social regardless of any -rt concerns. (it's a bit like
how people were throwing around opaque lock_kernel()/unlock_kernel() pairs
a decade ago. It results in poorly documented locking semantics.)
> Plus some convert-to-sleeping-lock conversions over the years which
> weren't obviously needed in mainline.  Or which at least had -rt
> motivations.  But that's different.
Having -rt motivation is perfectly fine - many of the top features we
added in the past 2-3 years originated in the -rt tree. The question 
is, does a change improve the mainline code or not. If it does, the
motivation does not really matter.
I'll also note that recent VFS performance tests with spinning mutexes
have shown that they out-perform both spinlocks, old-semaphores and
old-mutexes. So conversion to sleeping locks might in fact grow a
"because it's not only easier to hack but also faster" dimension as well.
( I'm wondering whether those ext2/ext3 spinlocks that were a performance
  problem when converted to sleeping locks would perform better with
  spinning mutexes. )
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
       [not found]         ` <20090213152836.0fbbfa7d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2009-02-16 17:37           ` Dave Hansen
  0 siblings, 0 replies; 121+ messages in thread
From: Dave Hansen @ 2009-02-16 17:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo-X9Un+BFzKDI, orenl-eQaUEPhvms7ENvBUuze7eA,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	tglx-hfZtesqFncYOwBW4kG4KsQ
On Fri, 2009-02-13 at 15:28 -0800, Andrew Morton wrote:
> > > For extra marks:
> > > 
> > > - Will any of this involve non-trivial serialisation of kernel
> > >   objects?  If so, that's getting into the
> > >   unacceptably-expensive-to-maintain space, I suspect.
> > 
> > We have some structures that are certainly tied to the kernel-internal
> > ones.  However, we are certainly *not* simply writing kernel structures
> > to userspace.  We could do that with /dev/mem.  We are carefully pulling
> > out the minimal bits of information from the kernel structures that we
> > *need* to recreate the function of the structure at restart.  There is a
> > maintenance burden here but, so far, that burden is almost entirely in
> > checkpoint/*.c.  We intend to test this functionality thoroughly to
> > ensure that we don't regress once we have integrated it.
> 
> I guess my question can be approximately simplified to: "will it end up
> looking like openvz"?  (I don't believe that we know of any other way
> of implementing this?)
> 
> Because if it does then that's a concern, because my assessment when I
> looked at that code (a number of years ago) was that having code of
> that nature in mainline would be pretty costly to us, and rather
> unwelcome.
With the current path, my guess is that we will end up looking
*something* like OpenVZ.  But, with all the input from the OpenVZ folks
and at least three other projects, I bet we can come up with something
better.  I do wish the OpenVZ folks were being more vocal and
constructive about Oren's current code but I guess silence is the
greatest complement...
> The broadest form of the question is "will we end up regretting having
> done this".
> If we can arrange for the implementation to sit quietly over in a
> corner with a team of people maintaining it and not screwing up other
> people's work then I guess we'd be OK - if it breaks then the breakage
> is localised.
> 
> And it's not just a matter of "does the diffstat only affect a single
> subdirectory".  We also should watch out for the imposition of new
> rules which kernel code must follow.  "you can't do that, because we
> can't serialise it", or something.
> 
> Similar to the way in which perfectly correct and normal kernel
> sometimes has to be changed because it unexpectedly upsets the -rt
> patch.
> 
> Do you expect that any restrictions of this type will be imposed?
Basically, yes.  But, practically, we haven't been thinking about
serializing stuff in the kernel, ever.  That's produced a few
difficult-to-serialize things like AF_UNIX sockets but absolutely
nothing that simply can't be done.  
Having this code in mainline and getting some of people's mindshare
should at least enable us to speak up if we see another thing like
AF_UNIX coming down the pipe.  We could hopefully catch it and at least
tweak it a bit to enhance how easily we can serialize it. 
Again, it isn't likely to be an all-or-nothing situation.  It is a
matter of how many hoops the checkpoint code itself has to jump
through. 
-- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
       [not found]                 ` <20090213105302.GC4608-X9Un+BFzKDI@public.gmane.org>
@ 2009-02-16 20:51                   ` Dave Hansen
  2009-02-17 22:23                     ` Ingo Molnar
  0 siblings, 1 reply; 121+ messages in thread
From: Dave Hansen @ 2009-02-16 20:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-GEFAQzZX7r8dnm+yROfE0A, Nathan Lynch
On Fri, 2009-02-13 at 11:53 +0100, Ingo Molnar wrote:
> In any case, by designing checkpointing to reuse the existing LSM
> callbacks, we'd hit multiple birds with the same stone. (One of
> which is the constant complaints about the runtime costs of the LSM
> callbacks - with checkpointing we get an independent, non-security
> user of the facility which is a nice touch.)
There's a fundamental problem with using LSM that I'm seeing now that I
look at using it for file descriptors.  The LSM hooks are there to say,
"No, you can't do this" and abort whatever kernel operation was going
on.  That's good for detecting when we do something that's "bad" for
checkpointing.
*But* it completely falls on its face when we want to find out when we
are doing things that are *good*.  For instance, let's say that we open
a network socket.  The LSM hook sees it and marks us as
uncheckpointable.  What about when we close it?  We've become
checkpointable again.  But, there's no LSM hook for the close side
because we don't currently have a need for it.
We have a couple of options:
We can let uncheckpointable actions behave like security violations and
just abort the kernel calls.  The problem with this is that it makes it
difficult to do *anything* unless your application is 100% supported.
Pretty inconvenient, especially at first.  Might be useful later on
though.
We could just log the actions and let them proceed.  But the problem
with this is that we don't get the temporal idea when an app transitions
between the "good" and "bad" states.  We would need to work on culling
the output in the logs since we'd be potentially getting a lot of
redundant data.
We could add to the set of security hooks.  Make sure that we cover all
the transitional states like close().
What I'm thinking about doing for now is what I have attached here.  We
allow the apps who we want to be checkpointed to query some interface
that will use the same checks that sys_checkpoint() does internally.
Say:
# cat /proc/1072/checkpointable
mm: 1
files: 0
...
Then, when it realizes that its files can't be checkpointed, it can look
elsewhere:
/proc/1072/fdinfo/2:pos:	0
/proc/1072/fdinfo/2:flags:	02
/proc/1072/fdinfo/2:checkpointable: 0 (special file)
/proc/1072/fdinfo/3:pos:	0
/proc/1072/fdinfo/3:flags:	04000
/proc/1072/fdinfo/3:checkpointable: 0 (pipefs does not support checkpoint)
/proc/1072/fdinfo/4:pos:	0
/proc/1072/fdinfo/4:flags:	04002
/proc/1072/fdinfo/4:checkpointable: 0 (sockfs does not support checkpoint)
/proc/1074/fdinfo/0:pos:	0
/proc/1074/fdinfo/0:flags:	0100002
/proc/1074/fdinfo/0:checkpointable: 0 (devpts does not support checkpoint)
That requires zero overhead during runtime of the app.  It is also less
error-prone because we don't have any of the transitions to catch.
-- Dave
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
index e3097ac..ebe776a 100644
--- a/checkpoint/ckpt_file.c
+++ b/checkpoint/ckpt_file.c
@@ -72,6 +72,32 @@ int cr_scan_fds(struct files_struct *files, int **fdtable)
 	return n;
 }
 
+int cr_can_checkpoint_file(struct file *file, char *explain, int left)
+{
+	char p[] = "checkpointable";
+	struct inode *inode = file->f_dentry->d_inode;
+	struct file_system_type *fs_type = inode->i_sb->s_type;
+
+	printk("%s() left: %d\n", __func__, left);
+
+	if (!(fs_type->fs_flags & FS_CHECKPOINTABLE)) {
+		if (explain)
+			snprintf(explain, left,
+				"%s: 0 (%s does not support checkpoint)\n",
+				p, fs_type->name);
+		return 0;
+	}
+
+	if (special_file(inode->i_mode)) {
+		if (explain)
+			snprintf(explain, left,	"%s: 0 (special file)\n", p);
+		return 0;
+	}
+
+	snprintf(explain, left, "%s: 1\n", p);
+	return 1;
+}
+
 /* cr_write_fd_data - dump the state of a given file pointer */
 static int cr_write_fd_data(struct cr_ctx *ctx, struct file *file, int parent)
 {
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d467760..2300353 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1597,7 +1597,19 @@ out:
 	return ~0U;
 }
 
-#define PROC_FDINFO_MAX 64
+#define PROC_FDINFO_MAX PAGE_SIZE
+
+static void proc_fd_write_info(struct file *file, char *info)
+{
+	int max = PROC_FDINFO_MAX;
+	int p = 0;
+	if (!info)
+		return;
+
+	p += snprintf(info+p, max-p, "pos:\t%lli\n", (long long) file->f_pos);
+	p += snprintf(info+p, max-p, "flags:\t0%o\n", file->f_flags);
+	cr_can_checkpoint_file(file, info, max-p);
+}
 
 static int proc_fd_info(struct inode *inode, struct path *path, char *info)
 {
@@ -1622,12 +1634,7 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info)
 				*path = file->f_path;
 				path_get(&file->f_path);
 			}
-			if (info)
-				snprintf(info, PROC_FDINFO_MAX,
-					 "pos:\t%lli\n"
-					 "flags:\t0%o\n",
-					 (long long) file->f_pos,
-					 file->f_flags);
+			proc_fd_write_info(file, info);
 			spin_unlock(&files->file_lock);
 			put_files_struct(files);
 			return 0;
@@ -1831,10 +1838,11 @@ static int proc_readfd(struct file *filp, void *dirent, filldir_t filldir)
 static ssize_t proc_fdinfo_read(struct file *file, char __user *buf,
 				      size_t len, loff_t *ppos)
 {
-	char tmp[PROC_FDINFO_MAX];
+	char *tmp = kmalloc(PROC_FDINFO_MAX, GFP_KERNEL);
 	int err = proc_fd_info(file->f_path.dentry->d_inode, NULL, tmp);
 	if (!err)
 		err = simple_read_from_buffer(buf, len, ppos, tmp, strlen(tmp));
+	kfree(tmp);
 	return err;
 }
 
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 217cf6e..84e69b0 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -142,11 +142,17 @@ static inline void __task_deny_checkpointing(struct task_struct *task,
 #define task_deny_checkpointing(p)  \
 	__task_deny_checkpointing(p, __FILE__, __LINE__)
 
+int cr_can_checkpoint_file(struct file *file, char *explain, int left);
+
 #else
 
 static inline void task_deny_checkpointing(struct task_struct *task) {}
 static inline void process_deny_checkpointing(struct task_struct *task) {}
 
-#endif
+static inline int cr_can_checkpoint_file(struct file *file, char *explain, int left)
+{
+	return 0;
+}
 
+#endif
 #endif /* _CHECKPOINT_CKPT_H_ */
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-16 20:51                   ` Dave Hansen
@ 2009-02-17 22:23                     ` Ingo Molnar
       [not found]                       ` <20090217222319.GA10546-X9Un+BFzKDI@public.gmane.org>
  0 siblings, 1 reply; 121+ messages in thread
From: Ingo Molnar @ 2009-02-17 22:23 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-api, containers, hpa, linux-kernel, linux-mm,
	viro, mpm, tglx, torvalds, xemul, Nathan Lynch
* Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> On Fri, 2009-02-13 at 11:53 +0100, Ingo Molnar wrote:
> > In any case, by designing checkpointing to reuse the existing LSM
> > callbacks, we'd hit multiple birds with the same stone. (One of
> > which is the constant complaints about the runtime costs of the LSM
> > callbacks - with checkpointing we get an independent, non-security
> > user of the facility which is a nice touch.)
> 
> There's a fundamental problem with using LSM that I'm seeing 
> now that I look at using it for file descriptors.  The LSM 
> hooks are there to say, "No, you can't do this" and abort 
> whatever kernel operation was going on.  That's good for 
> detecting when we do something that's "bad" for checkpointing.
> 
> *But* it completely falls on its face when we want to find out 
> when we are doing things that are *good*.  For instance, let's 
> say that we open a network socket.  The LSM hook sees it and 
> marks us as uncheckpointable.  What about when we close it?  
> We've become checkpointable again.  But, there's no LSM hook 
> for the close side because we don't currently have a need for 
> it.
Uncheckpointable should be a one-way flag anyway. We want this 
to become usable, so uncheckpointable functionality should be as 
painful as possible, to make sure it's getting fixed ...
> We have a couple of options:
> 
> We can let uncheckpointable actions behave like security 
> violations and just abort the kernel calls.  The problem with 
> this is that it makes it difficult to do *anything* unless 
> your application is 100% supported. Pretty inconvenient, 
> especially at first.  Might be useful later on though.
It still beats "no checkpointing support at all in the upstream 
kernel", by a wide merging. If an app fails, the more reasons to 
bring checkpointing support up to production quality? We dont 
want to make the 'interim' state _too_ convenient, because it 
will quickly turn into the status quo.
Really, the LSM approach seems to be the right approach here. It 
keeps maintenance costs very low - there's no widespread 
BKL-style flaggery.
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
       [not found]                       ` <20090217222319.GA10546-X9Un+BFzKDI@public.gmane.org>
@ 2009-02-17 22:30                         ` Dave Hansen
  2009-02-18  0:32                           ` Ingo Molnar
  0 siblings, 1 reply; 121+ messages in thread
From: Dave Hansen @ 2009-02-17 22:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-GEFAQzZX7r8dnm+yROfE0A, Nathan Lynch
On Tue, 2009-02-17 at 23:23 +0100, Ingo Molnar wrote:
> * Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> > On Fri, 2009-02-13 at 11:53 +0100, Ingo Molnar wrote:
> > > In any case, by designing checkpointing to reuse the existing LSM
> > > callbacks, we'd hit multiple birds with the same stone. (One of
> > > which is the constant complaints about the runtime costs of the LSM
> > > callbacks - with checkpointing we get an independent, non-security
> > > user of the facility which is a nice touch.)
> > 
> > There's a fundamental problem with using LSM that I'm seeing 
> > now that I look at using it for file descriptors.  The LSM 
> > hooks are there to say, "No, you can't do this" and abort 
> > whatever kernel operation was going on.  That's good for 
> > detecting when we do something that's "bad" for checkpointing.
> > 
> > *But* it completely falls on its face when we want to find out 
> > when we are doing things that are *good*.  For instance, let's 
> > say that we open a network socket.  The LSM hook sees it and 
> > marks us as uncheckpointable.  What about when we close it?  
> > We've become checkpointable again.  But, there's no LSM hook 
> > for the close side because we don't currently have a need for 
> > it.
> 
> Uncheckpointable should be a one-way flag anyway. We want this 
> to become usable, so uncheckpointable functionality should be as 
> painful as possible, to make sure it's getting fixed ...
Again, as these patches stand, we don't support checkpointing when
non-simple files are opened.  Basically, if a open()/lseek() pair won't
get you back where you were, we don't deal with them.
init does non-checkpointable things.  If the flag is a one-way trip,
we'll never be able to checkpoint because we'll always inherit init's !
checkpointable flag.  
To fix this, we could start working on making sure we can checkpoint
init, but that's practically worthless.
-- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-17 22:30                         ` Dave Hansen
@ 2009-02-18  0:32                           ` Ingo Molnar
  2009-02-18  0:40                             ` Dave Hansen
  0 siblings, 1 reply; 121+ messages in thread
From: Ingo Molnar @ 2009-02-18  0:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, linux-api, containers, hpa, linux-kernel, linux-mm,
	viro, mpm, tglx, torvalds, xemul, Nathan Lynch
* Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> On Tue, 2009-02-17 at 23:23 +0100, Ingo Molnar wrote:
> > * Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> > > On Fri, 2009-02-13 at 11:53 +0100, Ingo Molnar wrote:
> > > > In any case, by designing checkpointing to reuse the existing LSM
> > > > callbacks, we'd hit multiple birds with the same stone. (One of
> > > > which is the constant complaints about the runtime costs of the LSM
> > > > callbacks - with checkpointing we get an independent, non-security
> > > > user of the facility which is a nice touch.)
> > > 
> > > There's a fundamental problem with using LSM that I'm seeing 
> > > now that I look at using it for file descriptors.  The LSM 
> > > hooks are there to say, "No, you can't do this" and abort 
> > > whatever kernel operation was going on.  That's good for 
> > > detecting when we do something that's "bad" for checkpointing.
> > > 
> > > *But* it completely falls on its face when we want to find out 
> > > when we are doing things that are *good*.  For instance, let's 
> > > say that we open a network socket.  The LSM hook sees it and 
> > > marks us as uncheckpointable.  What about when we close it?  
> > > We've become checkpointable again.  But, there's no LSM hook 
> > > for the close side because we don't currently have a need for 
> > > it.
> > 
> > Uncheckpointable should be a one-way flag anyway. We want this 
> > to become usable, so uncheckpointable functionality should be as 
> > painful as possible, to make sure it's getting fixed ...
> 
> Again, as these patches stand, we don't support checkpointing 
> when non-simple files are opened.  Basically, if a 
> open()/lseek() pair won't get you back where you were, we 
> don't deal with them.
> 
> init does non-checkpointable things.  If the flag is a one-way 
> trip, we'll never be able to checkpoint because we'll always 
> inherit init's ! checkpointable flag.
> 
> To fix this, we could start working on making sure we can 
> checkpoint init, but that's practically worthless.
i mean, it should be per process (per app) one-way flag of 
course. If the app does something unsupported, it gets 
non-checkpointable and that's it.
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-18  0:32                           ` Ingo Molnar
@ 2009-02-18  0:40                             ` Dave Hansen
  2009-02-18  5:11                               ` Alexey Dobriyan
  0 siblings, 1 reply; 121+ messages in thread
From: Dave Hansen @ 2009-02-18  0:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, linux-api, containers, hpa, linux-kernel, linux-mm,
	viro, mpm, tglx, torvalds, xemul, Nathan Lynch
On Wed, 2009-02-18 at 01:32 +0100, Ingo Molnar wrote:
> > > Uncheckpointable should be a one-way flag anyway. We want this 
> > > to become usable, so uncheckpointable functionality should be as 
> > > painful as possible, to make sure it's getting fixed ...
> > 
> > Again, as these patches stand, we don't support checkpointing 
> > when non-simple files are opened.  Basically, if a 
> > open()/lseek() pair won't get you back where you were, we 
> > don't deal with them.
> > 
> > init does non-checkpointable things.  If the flag is a one-way 
> > trip, we'll never be able to checkpoint because we'll always 
> > inherit init's ! checkpointable flag.
> > 
> > To fix this, we could start working on making sure we can 
> > checkpoint init, but that's practically worthless.
> 
> i mean, it should be per process (per app) one-way flag of 
> course. If the app does something unsupported, it gets 
> non-checkpointable and that's it.
OK, we can definitely do that.  Do you think it is OK to run through a
set of checks at exec() time to check if the app currently has any
unsupported things going on?  If we don't directly inherit the parent's
status, then we need to have *some* time when we check it.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-18  0:40                             ` Dave Hansen
@ 2009-02-18  5:11                               ` Alexey Dobriyan
  2009-02-18 18:16                                 ` Ingo Molnar
  0 siblings, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-18  5:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Andrew Morton, linux-api, containers, hpa,
	linux-kernel, linux-mm, viro, mpm, tglx, torvalds, xemul,
	Nathan Lynch
On Tue, Feb 17, 2009 at 04:40:39PM -0800, Dave Hansen wrote:
> On Wed, 2009-02-18 at 01:32 +0100, Ingo Molnar wrote:
> > > > Uncheckpointable should be a one-way flag anyway. We want this 
> > > > to become usable, so uncheckpointable functionality should be as 
> > > > painful as possible, to make sure it's getting fixed ...
> > > 
> > > Again, as these patches stand, we don't support checkpointing 
> > > when non-simple files are opened.  Basically, if a 
> > > open()/lseek() pair won't get you back where you were, we 
> > > don't deal with them.
> > > 
> > > init does non-checkpointable things.  If the flag is a one-way 
> > > trip, we'll never be able to checkpoint because we'll always 
> > > inherit init's ! checkpointable flag.
> > > 
> > > To fix this, we could start working on making sure we can 
> > > checkpoint init, but that's practically worthless.
> > 
> > i mean, it should be per process (per app) one-way flag of 
> > course. If the app does something unsupported, it gets 
> > non-checkpointable and that's it.
> 
> OK, we can definitely do that.  Do you think it is OK to run through a
> set of checks at exec() time to check if the app currently has any
> unsupported things going on?  If we don't directly inherit the parent's
> status, then we need to have *some* time when we check it.
Uncheckpointable is not one-way.
Imagine remap_file_pages(2) is unsupported. Now app uses
remap_file_pages(2), then unmaps interesting VMA. Now app is
checkpointable again.
As for overloading LSM, I think, it would be horrible.
Most hooks are useless, there are config options expanding LSM hooks,
and CPT and LSM are just totally orthogonal.
Instead, just (no offence) get big enough coverage -- run modern and
past distros, run servers packaged with them, and if you can checkpoint
all of this, you're mostly fine.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-18  5:11                               ` Alexey Dobriyan
@ 2009-02-18 18:16                                 ` Ingo Molnar
       [not found]                                   ` <20090218181644.GD19995-X9Un+BFzKDI@public.gmane.org>
  0 siblings, 1 reply; 121+ messages in thread
From: Ingo Molnar @ 2009-02-18 18:16 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Dave Hansen, Andrew Morton, linux-api, containers, hpa,
	linux-kernel, linux-mm, viro, mpm, tglx, torvalds, xemul,
	Nathan Lynch
* Alexey Dobriyan <adobriyan@gmail.com> wrote:
> On Tue, Feb 17, 2009 at 04:40:39PM -0800, Dave Hansen wrote:
> > On Wed, 2009-02-18 at 01:32 +0100, Ingo Molnar wrote:
> > > > > Uncheckpointable should be a one-way flag anyway. We want this 
> > > > > to become usable, so uncheckpointable functionality should be as 
> > > > > painful as possible, to make sure it's getting fixed ...
> > > > 
> > > > Again, as these patches stand, we don't support checkpointing 
> > > > when non-simple files are opened.  Basically, if a 
> > > > open()/lseek() pair won't get you back where you were, we 
> > > > don't deal with them.
> > > > 
> > > > init does non-checkpointable things.  If the flag is a one-way 
> > > > trip, we'll never be able to checkpoint because we'll always 
> > > > inherit init's ! checkpointable flag.
> > > > 
> > > > To fix this, we could start working on making sure we can 
> > > > checkpoint init, but that's practically worthless.
> > > 
> > > i mean, it should be per process (per app) one-way flag of 
> > > course. If the app does something unsupported, it gets 
> > > non-checkpointable and that's it.
> > 
> > OK, we can definitely do that.  Do you think it is OK to run through a
> > set of checks at exec() time to check if the app currently has any
> > unsupported things going on?  If we don't directly inherit the parent's
> > status, then we need to have *some* time when we check it.
> 
> Uncheckpointable is not one-way.
> 
> Imagine remap_file_pages(2) is unsupported. Now app uses 
> remap_file_pages(2), then unmaps interesting VMA. Now app is 
> checkpointable again.
But that's precisely the kind of over-design that defeats the 
common purpose: which would be to make everything 
checkpointable. (including weirdo APIs like fremap())
Nothing motivates more than app designers complaining about the 
one-way flag.
Furthermore, it's _far_ easier to make a one-way flag SMP-safe. 
We just set it and that's it. When we unset it, what do we about 
SMP races with other threads in the same MM installing another 
non-linear vma, etc.
> As for overloading LSM, I think, it would be horrible. Most 
> hooks are useless, there are config options expanding LSM 
> hooks, and CPT and LSM are just totally orthogonal.
Sure it would have to be adopted to the needs of CPT, but i can 
tell you one thing for sure: there's only one thing that is 
worse than every syscall annotated with an LSM hook (which is 
the current status quo): every syscall annotated with an LSM 
hook _and_ a separate CPT hook.
It's just bad design. CPT might be orthogonal, but it wants to 
hook into syscalls at roughly the same places where LSM hooks 
into, which pretty much settles the question.
If there's places that need new hooks then we can add them not 
as CPT hooks, but as security hooks. That way there's synergy: 
both LSM and CPT advances, on the shoulders of each other.
> Instead, just (no offence) get big enough coverage -- run 
> modern and past distros, run servers packaged with them, and 
> if you can checkpoint all of this, you're mostly fine.
That's definitely a good advice, just it doesnt give the kind of 
minimal environment from where productization efforts can be 
seeded from.
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
       [not found]                                   ` <20090218181644.GD19995-X9Un+BFzKDI@public.gmane.org>
@ 2009-02-18 21:27                                     ` Dave Hansen
  2009-02-18 23:15                                       ` Ingo Molnar
  0 siblings, 1 reply; 121+ messages in thread
From: Dave Hansen @ 2009-02-18 21:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexey Dobriyan, Nathan Lynch, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Wed, 2009-02-18 at 19:16 +0100, Ingo Molnar wrote:
> Nothing motivates more than app designers complaining about the 
> one-way flag.
> 
> Furthermore, it's _far_ easier to make a one-way flag SMP-safe. 
> We just set it and that's it. When we unset it, what do we about 
> SMP races with other threads in the same MM installing another 
> non-linear vma, etc.
After looking at this for file descriptors, I have to really agree with
Ingo on this one, at least as far as the flag is concerned.  I want to
propose one teeny change, though:  I think the flag should be
per-resource.
We should have one flag in mm_struct, one in files_struct, etc...  The
task_is_checkpointable() function can just query task->mm, task->files,
etc...  This gives us nice behavior at clone() *and* fork that just
works.
I'll do this for files_struct and see how it comes out so you can take a
peek.
-- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-18 21:27                                     ` Dave Hansen
@ 2009-02-18 23:15                                       ` Ingo Molnar
  2009-02-19 19:06                                         ` Banning checkpoint (was: Re: What can OpenVZ do?) Alexey Dobriyan
  0 siblings, 1 reply; 121+ messages in thread
From: Ingo Molnar @ 2009-02-18 23:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexey Dobriyan, Nathan Lynch, linux-api, containers, mpm,
	linux-kernel, linux-mm, viro, hpa, Andrew Morton, torvalds, tglx,
	xemul
* Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> On Wed, 2009-02-18 at 19:16 +0100, Ingo Molnar wrote:
> > Nothing motivates more than app designers complaining about the 
> > one-way flag.
> > 
> > Furthermore, it's _far_ easier to make a one-way flag SMP-safe. 
> > We just set it and that's it. When we unset it, what do we about 
> > SMP races with other threads in the same MM installing another 
> > non-linear vma, etc.
> 
> After looking at this for file descriptors, I have to really 
> agree with Ingo on this one, at least as far as the flag is 
> concerned.  I want to propose one teeny change, though: I 
> think the flag should be per-resource.
> 
> We should have one flag in mm_struct, one in files_struct, 
> etc...  The task_is_checkpointable() function can just query 
> task->mm, task->files, etc...  This gives us nice behavior at 
> clone() *and* fork that just works.
> 
> I'll do this for files_struct and see how it comes out so you 
> can take a peek.
Yeah, per resource it should be. That's per task in the normal 
case - except for threaded workloads where it's shared by 
threads.
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Banning checkpoint (was: Re: What can OpenVZ do?)
  2009-02-18 23:15                                       ` Ingo Molnar
@ 2009-02-19 19:06                                         ` Alexey Dobriyan
  2009-02-19 19:11                                           ` Dave Hansen
  0 siblings, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-19 19:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, Nathan Lynch, linux-api, containers, mpm,
	linux-kernel, linux-mm, viro, hpa, Andrew Morton, torvalds, tglx,
	xemul
I think that all these efforts to abort checkpoint "intelligently" by
banning it early are completely misguided.
"Checkpointable" property isn't one-way ticket like "tainted" flag,
so doing it like tainted var isn't right, atomic or not, SMP-safe or
not.
With filesystems, one has ->f_op field to compare against banned
filesystems, one more flag isn't necessary.
Inotify isn't supported yet? You do
	if (!list_empty(&inode->inotify_watches))
		return -E;
without hooking into inotify syscalls.
ptrace(2) isn't supported -- look at struct task_struct::ptraced and
friends.
And so on.
System call (or whatever) does something with some piece of kernel
internals. We look at this "something" when walking data structures and
abort if it's scary enough.
Please, show at least one counter-example.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
  2009-02-19 19:06                                         ` Banning checkpoint (was: Re: What can OpenVZ do?) Alexey Dobriyan
@ 2009-02-19 19:11                                           ` Dave Hansen
  2009-02-24  4:47                                             ` Alexey Dobriyan
  0 siblings, 1 reply; 121+ messages in thread
From: Dave Hansen @ 2009-02-19 19:11 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, Nathan Lynch, linux-api, containers, mpm,
	linux-kernel, linux-mm, viro, hpa, Andrew Morton, torvalds, tglx,
	xemul
On Thu, 2009-02-19 at 22:06 +0300, Alexey Dobriyan wrote:
> Inotify isn't supported yet? You do
> 
>         if (!list_empty(&inode->inotify_watches))
>                 return -E;
> 
> without hooking into inotify syscalls.
> 
> ptrace(2) isn't supported -- look at struct task_struct::ptraced and
> friends.
> 
> And so on.
> 
> System call (or whatever) does something with some piece of kernel
> internals. We look at this "something" when walking data structures
> and
> abort if it's scary enough.
> 
> Please, show at least one counter-example.
Alexey, I agree with you here.  I've been fighting myself internally
about these two somewhat opposing approaches.  Of *course* we can
determine the "checkpointability" at sys_checkpoint() time by checking
all the various bits of state.
The problem that I think Ingo is trying to address here is that doing it
then makes it hard to figure out _when_ you went wrong.  That's the
single most critical piece of finding out how to go address it.
I see where you are coming from.  Ingo's suggestion has the *huge*
downside that we've got to go muck with a lot of generic code and hook
into all the things we don't support.
I think what I posted is a decent compromise.  It gets you those
warnings at runtime and is a one-way trip for any given process.  But,
it does detect in certain cases (fork() and unshare(FILES)) when it is
safe to make the trip back to the "I'm checkpointable" state again.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
  2009-02-19 19:11                                           ` Dave Hansen
@ 2009-02-24  4:47                                             ` Alexey Dobriyan
       [not found]                                               ` <20090224044752.GB3202-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
  0 siblings, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-24  4:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Nathan Lynch, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Thu, Feb 19, 2009 at 11:11:54AM -0800, Dave Hansen wrote:
> On Thu, 2009-02-19 at 22:06 +0300, Alexey Dobriyan wrote:
> > Inotify isn't supported yet? You do
> > 
> >         if (!list_empty(&inode->inotify_watches))
> >                 return -E;
> > 
> > without hooking into inotify syscalls.
> > 
> > ptrace(2) isn't supported -- look at struct task_struct::ptraced and
> > friends.
> > 
> > And so on.
> > 
> > System call (or whatever) does something with some piece of kernel
> > internals. We look at this "something" when walking data structures
> > and
> > abort if it's scary enough.
> > 
> > Please, show at least one counter-example.
> 
> Alexey, I agree with you here.  I've been fighting myself internally
> about these two somewhat opposing approaches.  Of *course* we can
> determine the "checkpointability" at sys_checkpoint() time by checking
> all the various bits of state.
> 
> The problem that I think Ingo is trying to address here is that doing it
> then makes it hard to figure out _when_ you went wrong.  That's the
> single most critical piece of finding out how to go address it.
> 
> I see where you are coming from.  Ingo's suggestion has the *huge*
> downside that we've got to go muck with a lot of generic code and hook
> into all the things we don't support.
> 
> I think what I posted is a decent compromise.  It gets you those
> warnings at runtime and is a one-way trip for any given process.  But,
> it does detect in certain cases (fork() and unshare(FILES)) when it is
> safe to make the trip back to the "I'm checkpointable" state again.
"Checkpointable" is not even per-process property.
Imagine, set of SAs (struct xfrm_state) and SPDs (struct xfrm_policy).
They are a) per-netns, b) persistent.
You can hook into socketcalls to mark process as uncheckpointable,
but since SAs and SPDs are persistent, original process already exited.
You're going to walk every process with same netns as SA adder and mark
it as uncheckpointable. Definitely doable, but ugly, isn't it?
Same for iptable rules.
"Checkpointable" is container property, OK?
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
       [not found]                                               ` <20090224044752.GB3202-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
@ 2009-02-24  5:11                                                 ` Dave Hansen
  2009-02-24 15:43                                                   ` Serge E. Hallyn
  2009-02-24 20:09                                                   ` Alexey Dobriyan
  0 siblings, 2 replies; 121+ messages in thread
From: Dave Hansen @ 2009-02-24  5:11 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, Nathan Lynch, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Tue, 2009-02-24 at 07:47 +0300, Alexey Dobriyan wrote:
> > I think what I posted is a decent compromise.  It gets you those
> > warnings at runtime and is a one-way trip for any given process.  But,
> > it does detect in certain cases (fork() and unshare(FILES)) when it is
> > safe to make the trip back to the "I'm checkpointable" state again.
> 
> "Checkpointable" is not even per-process property.
> 
> Imagine, set of SAs (struct xfrm_state) and SPDs (struct xfrm_policy).
> They are a) per-netns, b) persistent.
> 
> You can hook into socketcalls to mark process as uncheckpointable,
> but since SAs and SPDs are persistent, original process already exited.
> You're going to walk every process with same netns as SA adder and mark
> it as uncheckpointable. Definitely doable, but ugly, isn't it?
> 
> Same for iptable rules.
> 
> "Checkpointable" is container property, OK?
Ideally, I completely agree.
But, we don't currently have a concept of a true container in the
kernel.  Do you have any suggestions for any current objects that we
could use in its place for a while?
-- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 05/14] x86 support for checkpoint/restart
  2009-01-27 17:08 ` [RFC v13][PATCH 05/14] x86 support for checkpoint/restart Oren Laadan
@ 2009-02-24  7:47   ` Nathan Lynch
       [not found]     ` <20090224014739.1b82fc35-4v5LP+xe+1byhTdZtsIeww@public.gmane.org>
  2009-03-18  7:21     ` Oren Laadan
  0 siblings, 2 replies; 121+ messages in thread
From: Nathan Lynch @ 2009-02-24  7:47 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Andrew Morton, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linus Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar
Hi, this is an old thread I guess, but I just noticed some issues while
looking at this code.
On Tue, 27 Jan 2009 12:08:03 -0500
Oren Laadan <orenl@cs.columbia.edu> wrote:
> +static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
> +{
> +	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
> +	int ret;
> +
> +	ret = cr_kread(ctx, xstate_buf, xstate_size);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* i387 + MMU + SSE */
> +	preempt_disable();
> +
> +	/* init_fpu() also calls set_used_math() */
> +	ret = init_fpu(current);
> +	if (ret < 0)
> +		return ret;
Several problems here:
* init_fpu can call kmem_cache_alloc(GFP_KERNEL), but is called here
  with preempt disabled (init_fpu could use a might_sleep annotation?)
* if init_fpu returns an error, we get preempt imbalance
* if init_fpu returns an error, we "leak" the cr_hbuf_get for
  xstate_buf
Speaking of cr_hbuf_get... I'd prefer to see that "allocator" go away
and its users converted to kmalloc/kfree (this is what I've done for
the powerpc C/R code, btw).
Using the slab allocator would:
* make the code less obscure and easier to review
* make the code more amenable to static analysis
* gain the benefits of slab debugging at runtime
But I think this has been pointed out before.  If I understand the
justification for cr_hbuf_get correctly, the allocations it services
are somehow known to be bounded in size and nesting.  But even if that
is the case, it's not much of a reason to avoid using kmalloc, is it?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
  2009-02-24  5:11                                                 ` Dave Hansen
@ 2009-02-24 15:43                                                   ` Serge E. Hallyn
  2009-02-24 20:09                                                   ` Alexey Dobriyan
  1 sibling, 0 replies; 121+ messages in thread
From: Serge E. Hallyn @ 2009-02-24 15:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexey Dobriyan, hpa, linux-api, containers, Nathan Lynch,
	linux-kernel, linux-mm, tglx, viro, mpm, Ingo Molnar, torvalds,
	Andrew Morton, xemul
Quoting Dave Hansen (dave@linux.vnet.ibm.com):
> On Tue, 2009-02-24 at 07:47 +0300, Alexey Dobriyan wrote:
> > > I think what I posted is a decent compromise.  It gets you those
> > > warnings at runtime and is a one-way trip for any given process.  But,
> > > it does detect in certain cases (fork() and unshare(FILES)) when it is
> > > safe to make the trip back to the "I'm checkpointable" state again.
> > 
> > "Checkpointable" is not even per-process property.
> > 
> > Imagine, set of SAs (struct xfrm_state) and SPDs (struct xfrm_policy).
> > They are a) per-netns, b) persistent.
> > 
> > You can hook into socketcalls to mark process as uncheckpointable,
> > but since SAs and SPDs are persistent, original process already exited.
> > You're going to walk every process with same netns as SA adder and mark
> > it as uncheckpointable. Definitely doable, but ugly, isn't it?
> > 
> > Same for iptable rules.
> > 
> > "Checkpointable" is container property, OK?
> 
> Ideally, I completely agree.
> 
> But, we don't currently have a concept of a true container in the
> kernel.  Do you have any suggestions for any current objects that we
> could use in its place for a while?
I think the main point is that it makes the concept of marking a task as
uncheckpointable unworkable.  So at sys_checkpoint() time or when we cat
/proc/$$/checkpointable, we can check for all of the uncheckpointable
state of both $$ and its container (including whether $$ is a container
init).  But we can't expect that (to use Alexey's example) when one task
in a netns does a certain sys_socketcall, all tasks in the container
will be marked uncheckpointable.  Or at least we don't want to.
Which means task->uncheckpointable can't be the big stick which I think
you were hoping it would be.
-serge
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 05/14] x86 support for checkpoint/restart
       [not found]     ` <20090224014739.1b82fc35-4v5LP+xe+1byhTdZtsIeww@public.gmane.org>
@ 2009-02-24 16:06       ` Dave Hansen
  0 siblings, 0 replies; 121+ messages in thread
From: Dave Hansen @ 2009-02-24 16:06 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Oren Laadan, Andrew Morton, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linus Torvalds, Alexander Viro,
	H. Peter Anvin, Thomas Gleixner, Ingo Molnar
On Tue, 2009-02-24 at 01:47 -0600, Nathan Lynch wrote:
> But I think this has been pointed out before.  If I understand the
> justification for cr_hbuf_get correctly, the allocations it services
> are somehow known to be bounded in size and nesting.  But even if that
> is the case, it's not much of a reason to avoid using kmalloc, is it?
Oren wants this particular facility to be used for live migration.  To
support good live migration, we need to be able to return from the
syscall as fast as possible.  To do that, Oren proposed that we buffer
all the data needed for the checkpoint inside the kernel.
The current cr_hbuf_put/get() could easily be modified to support this
usage by basically making put() do nothing, then handing off a handle to
the cr_ctx structure elsewhere in the kernel.  When the time comes to
free up the in-memory image, you only have one simple structure to go
free (the hbuf) as opposed to a bunch of little kmalloc()'d objects.
I'm sure I'm missing something.  I'm also sure that this *will* work
eventually.  But, I don't think the code as it stands supports keeping
the abstraction in there.  It is virtually impossible to debate the
design or its alternatives in this state.
-- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
  2009-02-24  5:11                                                 ` Dave Hansen
  2009-02-24 15:43                                                   ` Serge E. Hallyn
@ 2009-02-24 20:09                                                   ` Alexey Dobriyan
  1 sibling, 0 replies; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-24 20:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Nathan Lynch, linux-api, containers, mpm,
	linux-kernel, linux-mm, viro, hpa, Andrew Morton, torvalds, tglx,
	xemul
On Mon, Feb 23, 2009 at 09:11:25PM -0800, Dave Hansen wrote:
> On Tue, 2009-02-24 at 07:47 +0300, Alexey Dobriyan wrote:
> > > I think what I posted is a decent compromise.  It gets you those
> > > warnings at runtime and is a one-way trip for any given process.  But,
> > > it does detect in certain cases (fork() and unshare(FILES)) when it is
> > > safe to make the trip back to the "I'm checkpointable" state again.
> > 
> > "Checkpointable" is not even per-process property.
> > 
> > Imagine, set of SAs (struct xfrm_state) and SPDs (struct xfrm_policy).
> > They are a) per-netns, b) persistent.
> > 
> > You can hook into socketcalls to mark process as uncheckpointable,
> > but since SAs and SPDs are persistent, original process already exited.
> > You're going to walk every process with same netns as SA adder and mark
> > it as uncheckpointable. Definitely doable, but ugly, isn't it?
> > 
> > Same for iptable rules.
> > 
> > "Checkpointable" is container property, OK?
> 
> Ideally, I completely agree.
> 
> But, we don't currently have a concept of a true container in the
> kernel.  Do you have any suggestions for any current objects that we
> could use in its place for a while?
After all foo_ns changes struct nsproxy is such thing.
More specific, a process with fully cloned nsproxy acting as init,
all its children. In terms of data structures, every task_struct in such
tree, every nsproxy of them, every foo_ns, and so on to lower levels.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-12 23:04               ` How much of a mess does OpenVZ make? ;) Was: " Dave Hansen
@ 2009-02-26 15:57                 ` Alexey Dobriyan
  2009-03-10 21:53                   ` Alexey Dobriyan
  2009-02-26 16:27                 ` Alexey Dobriyan
  1 sibling, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-26 15:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-X9Un+BFzKDI,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Thu, Feb 12, 2009 at 03:04:05PM -0800, Dave Hansen wrote:
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat 
>  Makefile        |   53 +
>  cpt_conntrack.c |  365 ++++++++++++
>  cpt_context.c   |  257 ++++++++
>  cpt_context.h   |  215 +++++++
>  cpt_dump.c      | 1250 ++++++++++++++++++++++++++++++++++++++++++
>  cpt_dump.h      |   16 
>  cpt_epoll.c     |  113 +++
>  cpt_exports.c   |   13 
>  cpt_files.c     | 1626 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  cpt_files.h     |   71 ++
>  cpt_fsmagic.h   |   16 
>  cpt_inotify.c   |  144 ++++
>  cpt_kernel.c    |  177 ++++++
>  cpt_kernel.h    |   99 +++
>  cpt_mm.c        |  923 +++++++++++++++++++++++++++++++
>  cpt_mm.h        |   35 +
>  cpt_net.c       |  614 ++++++++++++++++++++
>  cpt_net.h       |    7 
>  cpt_obj.c       |  162 +++++
>  cpt_obj.h       |   62 ++
>  cpt_proc.c      |  595 ++++++++++++++++++++
>  cpt_process.c   | 1369 ++++++++++++++++++++++++++++++++++++++++++++++
>  cpt_process.h   |   13 
>  cpt_socket.c    |  790 ++++++++++++++++++++++++++
>  cpt_socket.h    |   33 +
>  cpt_socket_in.c |  450 +++++++++++++++
>  cpt_syscalls.h  |  101 +++
>  cpt_sysvipc.c   |  403 +++++++++++++
>  cpt_tty.c       |  215 +++++++
>  cpt_ubc.c       |  132 ++++
>  cpt_ubc.h       |   23 
>  cpt_x8664.S     |   67 ++
>  rst_conntrack.c |  283 +++++++++
>  rst_context.c   |  323 ++++++++++
>  rst_epoll.c     |  169 +++++
>  rst_files.c     | 1648 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  rst_inotify.c   |  196 ++++++
>  rst_mm.c        | 1151 +++++++++++++++++++++++++++++++++++++++
>  rst_net.c       |  741 +++++++++++++++++++++++++
>  rst_proc.c      |  580 +++++++++++++++++++
>  rst_process.c   | 1640 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  rst_socket.c    |  918 +++++++++++++++++++++++++++++++
>  rst_socket_in.c |  489 ++++++++++++++++
>  rst_sysvipc.c   |  633 +++++++++++++++++++++
>  rst_tty.c       |  384 +++++++++++++
>  rst_ubc.c       |  131 ++++
>  rst_undump.c    | 1007 ++++++++++++++++++++++++++++++++++
>  47 files changed, 20702 insertions(+)
> 
> One important thing that leaves out is the interaction that this code
> has with the rest of the kernel.  That's critically important when
> considering long-term maintenance, and I'd be curious how the OpenVZ
> folks view it. 
OpenVZ as-is in some cases wants some functions to be made global
(and if C/R code will be modular, exported). Or probably several
iterators added.
But it's negligible amount of changes compared to main code.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-12 23:04               ` How much of a mess does OpenVZ make? ;) Was: " Dave Hansen
  2009-02-26 15:57                 ` Alexey Dobriyan
@ 2009-02-26 16:27                 ` Alexey Dobriyan
  2009-02-26 17:33                   ` Ingo Molnar
  1 sibling, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-26 16:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, mpm, containers, hpa, linux-kernel, linux-mm, viro,
	linux-api, mingo, torvalds, tglx, xemul
Regarding interactions of C/R with other code:
1. trivia
1a. field in some datastructure is removed
	technically, compilation breaks
	Need to decide what to do -- from trivial compile fix
	by removing code to ignoring some fields in dump image.
1b. field is added
	This is likely to happen silently, so maintainers
	will have to keep an eye on critical data structures
	and general big changes in core kernel.
	Need to decide what to do with new field --
	anything from 'doesn't matter' to 'yeah, needs C/R part'
	with dump format change.
2. non-trivia
2a. standalone subsystem added (say, network protocol)
    If submitter sends C/R part -- excellent.
    If he doesn't, well, don't forget to add tiny bit of check
	and abort if said subsystem is in use.
2b. massacre inside some subsystem (say, struct cred introduction)
	Likely, C/R non-trivially breaks both in compilation and
	in working, requires non-trivial changes in algorithms and in
	C/R dump image.
For some very core data structures dump file images should be made
fatter than needed to more future-proof, like
a) statistics in u64 regardless of in-kernel width.
b) ->vm_flags in image should be at least u64 and bits made append-only
	so dump format would survive flags addition, removal and
	renumbering.
and so on.
So I guess, at first C/R maintainers will take care of all of these issues
with default policy being 'return -E, implement C/R later',
but, ideally, C/R will have same rights as other kernel subsystem, so people
will make non-trivial changes in C/R as they make their own non-trivial
changes.
If last statement isn't acceptable, in-kernel C/R is likely doomed from
the start (especially given lack of in-kernel testsuite).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-26 16:27                 ` Alexey Dobriyan
@ 2009-02-26 17:33                   ` Ingo Molnar
       [not found]                     ` <20090226173302.GB29439-X9Un+BFzKDI@public.gmane.org>
  0 siblings, 1 reply; 121+ messages in thread
From: Ingo Molnar @ 2009-02-26 17:33 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Andrew Morton, Dave Hansen, mpm, containers, hpa, linux-kernel,
	linux-mm, viro, linux-api, torvalds, tglx, xemul
* Alexey Dobriyan <adobriyan@gmail.com> wrote:
> Regarding interactions of C/R with other code:
> 
> 1. trivia
> 1a. field in some datastructure is removed
> 
> 	technically, compilation breaks
> 
> 	Need to decide what to do -- from trivial compile fix
> 	by removing code to ignoring some fields in dump image.
> 
> 1b. field is added
> 
> 	This is likely to happen silently, so maintainers
> 	will have to keep an eye on critical data structures
> 	and general big changes in core kernel.
> 
> 	Need to decide what to do with new field --
> 	anything from 'doesn't matter' to 'yeah, needs C/R part'
> 	with dump format change.
> 
> 2. non-trivia
> 2a. standalone subsystem added (say, network protocol)
> 
>     If submitter sends C/R part -- excellent.
>     If he doesn't, well, don't forget to add tiny bit of check
> 	and abort if said subsystem is in use.
> 
> 2b. massacre inside some subsystem (say, struct cred introduction)
> 
> 	Likely, C/R non-trivially breaks both in compilation and
> 	in working, requires non-trivial changes in algorithms and in
> 	C/R dump image.
> 
> For some very core data structures dump file images should be made
> fatter than needed to more future-proof, like
> a) statistics in u64 regardless of in-kernel width.
> b) ->vm_flags in image should be at least u64 and bits made append-only
> 	so dump format would survive flags addition, removal and
> 	renumbering.
> and so on.
> 
> 
> 
> So I guess, at first C/R maintainers will take care of all of 
> these issues with default policy being 'return -E, implement 
> C/R later', but, ideally, C/R will have same rights as other 
> kernel subsystem, so people will make non-trivial changes in 
> C/R as they make their own non-trivial changes.
> 
> If last statement isn't acceptable, in-kernel C/R is likely 
> doomed from the start (especially given lack of in-kernel 
> testsuite).
Well, given the fact that OpenVZ has followed such upstream 
changes for years successfully, there's precedent that it's 
possible to do it and stay sane.
If C/R is bitrotting will it be blamed on the maintainer who 
broke it, or on C/R maintainers? Do we have a good, fast and 
thin vector along which we can quickly tag Kconfig spaces (or 
even runtime flags) that are known (or discovered) to be C/R 
unsafe?
Is there any automated test that could discover C/R breakage via 
brute force? All that matters in such cases is to get the "you 
broke stuff" information as soon as possible. If it comes at an 
early stage developers can generally just fix stuff. If it comes 
in late, close to some release, people become more argumentative 
and might attack C/R instead of fixing the code.
I think the main question is: will we ever find ourselves in the 
future saying that "C/R sucks, nobody but a small minority uses 
it, wish we had never merged it"? I think the likelyhood of that 
is very low. I think the current OpenVZ stuff already looks very 
useful, and i dont think we've realized (let alone explored) all 
the possibilities yet.
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                     ` <20090226173302.GB29439-X9Un+BFzKDI@public.gmane.org>
@ 2009-02-26 18:30                       ` Greg Kurz
  2009-02-26 22:17                         ` Alexey Dobriyan
  2009-02-26 22:31                       ` Alexey Dobriyan
  1 sibling, 1 reply; 121+ messages in thread
From: Greg Kurz @ 2009-02-26 18:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexey Dobriyan, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Thu, 2009-02-26 at 18:33 +0100, Ingo Molnar wrote:
> I think the main question is: will we ever find ourselves in the 
> future saying that "C/R sucks, nobody but a small minority uses 
> it, wish we had never merged it"? I think the likelyhood of that 
> is very low. I think the current OpenVZ stuff already looks very 
We've been maintaining for some years now a C/R middleware with only a
few hooks in the kernel. Our strategy is to leverage existing kernel
paths as they do most of the work right.
Most of the checkpoint is performed from userspace, using regular
syscalls in a signal handler or /proc parsing. Restart is a bit trickier
and needs some kernel support to bypass syscall checks and enforce a
specific id for a resource. At the end, we support C/R and live
migration of networking apps (websphere application server for example).
>From our experience, we can tell:
Pros: mostly not-so-tricky userland code, independent from kernel
internals
Cons: sub-optimal for some resources
-- 
Gregory Kurz                                     gkurz-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)534 638 479                           Fax +33 (0)561 400 420
"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-26 18:30                       ` Greg Kurz
@ 2009-02-26 22:17                         ` Alexey Dobriyan
       [not found]                           ` <20090226221709.GA2924-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
  2009-02-27  9:36                           ` Cedric Le Goater
  0 siblings, 2 replies; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-26 22:17 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Ingo Molnar, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Thu, Feb 26, 2009 at 07:30:16PM +0100, Greg Kurz wrote:
> On Thu, 2009-02-26 at 18:33 +0100, Ingo Molnar wrote:
> > I think the main question is: will we ever find ourselves in the 
> > future saying that "C/R sucks, nobody but a small minority uses 
> > it, wish we had never merged it"? I think the likelyhood of that 
> > is very low. I think the current OpenVZ stuff already looks very 
> 
> We've been maintaining for some years now a C/R middleware with only a
> few hooks in the kernel. Our strategy is to leverage existing kernel
> paths as they do most of the work right.
> 
> Most of the checkpoint is performed from userspace, using regular
> syscalls in a signal handler or /proc parsing. Restart is a bit trickier
> and needs some kernel support to bypass syscall checks and enforce a
> specific id for a resource. At the end, we support C/R and live
> migration of networking apps (websphere application server for example).
> 
> >From our experience, we can tell:
> 
> Pros: mostly not-so-tricky userland code, independent from kernel
> internals
> Cons: sub-optimal for some resources
How do you restore struct task_struct::did_exec ?
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                     ` <20090226173302.GB29439-X9Un+BFzKDI@public.gmane.org>
  2009-02-26 18:30                       ` Greg Kurz
@ 2009-02-26 22:31                       ` Alexey Dobriyan
  2009-02-27  9:03                         ` Ingo Molnar
                                           ` (2 more replies)
  1 sibling, 3 replies; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-26 22:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Dave Hansen, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Thu, Feb 26, 2009 at 06:33:02PM +0100, Ingo Molnar wrote:
> 
> * Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> > Regarding interactions of C/R with other code:
> > 
> > 1. trivia
> > 1a. field in some datastructure is removed
> > 
> > 	technically, compilation breaks
> > 
> > 	Need to decide what to do -- from trivial compile fix
> > 	by removing code to ignoring some fields in dump image.
> > 
> > 1b. field is added
> > 
> > 	This is likely to happen silently, so maintainers
> > 	will have to keep an eye on critical data structures
> > 	and general big changes in core kernel.
> > 
> > 	Need to decide what to do with new field --
> > 	anything from 'doesn't matter' to 'yeah, needs C/R part'
> > 	with dump format change.
> > 
> > 2. non-trivia
> > 2a. standalone subsystem added (say, network protocol)
> > 
> >     If submitter sends C/R part -- excellent.
> >     If he doesn't, well, don't forget to add tiny bit of check
> > 	and abort if said subsystem is in use.
> > 
> > 2b. massacre inside some subsystem (say, struct cred introduction)
> > 
> > 	Likely, C/R non-trivially breaks both in compilation and
> > 	in working, requires non-trivial changes in algorithms and in
> > 	C/R dump image.
> > 
> > For some very core data structures dump file images should be made
> > fatter than needed to more future-proof, like
> > a) statistics in u64 regardless of in-kernel width.
> > b) ->vm_flags in image should be at least u64 and bits made append-only
> > 	so dump format would survive flags addition, removal and
> > 	renumbering.
> > and so on.
> > 
> > 
> > 
> > So I guess, at first C/R maintainers will take care of all of 
> > these issues with default policy being 'return -E, implement 
> > C/R later', but, ideally, C/R will have same rights as other 
> > kernel subsystem, so people will make non-trivial changes in 
> > C/R as they make their own non-trivial changes.
> > 
> > If last statement isn't acceptable, in-kernel C/R is likely 
> > doomed from the start (especially given lack of in-kernel 
> > testsuite).
> 
> Well, given the fact that OpenVZ has followed such upstream 
> changes for years successfully, there's precedent that it's 
> possible to do it and stay sane.
> 
> If C/R is bitrotting will it be blamed on the maintainer who 
> broke it, or on C/R maintainers?
Eventually, I hope, on patch submitter. In reality, people will have
little intuition with C/R so telling them to fix it is not right.
> Do we have a good, fast and thin vector along which we can quickly
> tag Kconfig spaces (or even runtime flags) that are known
> (or discovered) to be C/R unsafe?
Good -- yes, fast -- yes, Kconfig -- no, because config option turned on
doesn't application uses it.
See cr_dump_cred(), cr_check_cred(), cr_check_* for what is easy to do
to prevent C/R and invisible breakage.
See check in cr_collect_mm() where refcounts are compared to prevent
C/R where root cause is unknown.
> Is there any automated test that could discover C/R breakage via 
> brute force?
So far I'm relying on BUILD_BUG_ON(), but I probably don't understand
what you're asking.
> All that matters in such cases is to get the "you  broke stuff"
> information as soon as possible. If it comes at an early stage
> developers can generally just fix stuff. If it comes in late,
> close to some release, people become more argumentative and might
> attack C/R instead of fixing the code.
I hope for 'make test' but this is unrealistic right now
(read: lack of manpower :-)
> I think the main question is: will we ever find ourselves in the 
> future saying that "C/R sucks, nobody but a small minority uses 
> it, wish we had never merged it"? I think the likelyhood of that 
> is very low. I think the current OpenVZ stuff already looks very 
> useful, and i dont think we've realized (let alone explored) all 
> the possibilities yet.
This is collecting and start of dumping part of cleaned up OpenVZ C/R
implementation, FYI.
 arch/x86/include/asm/unistd_32.h   |    2 
 arch/x86/kernel/syscall_table_32.S |    2 
 include/linux/Kbuild               |    1 
 include/linux/cr.h                 |   56 ++++++
 include/linux/ipc_namespace.h      |    3 
 include/linux/syscalls.h           |    5 
 init/Kconfig                       |    2 
 kernel/Makefile                    |    1 
 kernel/cr/Kconfig                  |   11 +
 kernel/cr/Makefile                 |    8 
 kernel/cr/cpt-cred.c               |  115 +++++++++++++
 kernel/cr/cpt-fs.c                 |  122 +++++++++++++
 kernel/cr/cpt-mm.c                 |  134 +++++++++++++++
 kernel/cr/cpt-ns.c                 |  324 +++++++++++++++++++++++++++++++++++++
 kernel/cr/cpt-signal.c             |  121 +++++++++++++
 kernel/cr/cpt-sys.c                |  228 ++++++++++++++++++++++++++
 kernel/cr/cr-ctx.c                 |  141 ++++++++++++++++
 kernel/cr/cr.h                     |   61 ++++++
 kernel/cr/rst-sys.c                |    9 +
 kernel/sys_ni.c                    |    3 
 20 files changed, 1349 insertions(+)
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..9504ede 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,8 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restore		334
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index e2e86a0..9f8c398 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index b97cdc5..113d257 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -50,6 +50,7 @@ header-y += coff.h
 header-y += comstats.h
 header-y += const.h
 header-y += cgroupstats.h
+header-y += cr.h
 header-y += cramfs_fs.h
 header-y += cycx_cfm.h
 header-y += dlmconstants.h
diff --git a/include/linux/cr.h b/include/linux/cr.h
new file mode 100644
index 0000000..33fddd9
--- /dev/null
+++ b/include/linux/cr.h
@@ -0,0 +1,56 @@
+#ifndef __INCLUDE_LINUX_CR_H
+#define __INCLUDE_LINUX_CR_H
+
+#include <linux/types.h>
+
+struct cr_header {
+	/* Immutable part except version bumps. */
+#define CR_HEADER_MAGIC	"LinuxC/R"
+	__u8	cr_signature[8];
+#define CR_IMAGE_VERSION	1
+	__le64	cr_image_version;
+
+	/* Mutable part. */
+	__u8	cr_uts_release[64];	/* Give distro kernels a chance. */
+#define CR_ARCH_X86_32	1
+	__le32	cr_arch;
+};
+
+struct cr_object_header {
+#define CR_OBJ_UTS_NS	1
+#define CR_OBJ_CRED	2
+	__u32	cr_type;	/* object type */
+	__u32	cr_len;		/* object length in bytes including header */
+};
+
+#define cr_type	cr_hdr.cr_type
+#define cr_len	cr_hdr.cr_len
+
+struct cr_image_uts_ns {
+	struct cr_object_header cr_hdr;
+
+	__u8	cr_sysname[64];
+	__u8	cr_nodename[64];
+	__u8	cr_release[64];
+	__u8	cr_version[64];
+	__u8	cr_machine[64];
+	__u8	cr_domainname[64];
+};
+
+struct cr_image_cred {
+	struct cr_object_header cr_hdr;
+
+	__u32	cr_uid;
+	__u32	cr_gid;
+	__u32	cr_suid;
+	__u32	cr_sgid;
+	__u32	cr_euid;
+	__u32	cr_egid;
+	__u32	cr_fsuid;
+	__u32	cr_fsgid;
+	__u64	cr_cap_inheritable;
+	__u64	cr_cap_permitted;
+	__u64	cr_cap_effective;
+	__u64	cr_cap_bset;
+};
+#endif
diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index ea330f9..87a8053 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -3,9 +3,12 @@
 
 #include <linux/err.h>
 #include <linux/idr.h>
+#include <linux/kref.h>
 #include <linux/rwsem.h>
 #include <linux/notifier.h>
 
+struct kern_ipc_perm;
+
 /*
  * ipc namespace events
  */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f9f900c..fac8fa9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,11 @@ asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
 asmlinkage long sys_pipe2(int __user *, int);
 asmlinkage long sys_pipe(int __user *);
 
+#ifdef CONFIG_CR
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int fd, unsigned long flags);
+#endif
+
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index f068071..1b69c64 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -540,6 +540,8 @@ config CGROUP_MEM_RES_CTLR_SWAP
 
 endif # CGROUPS
 
+source "kernel/cr/Kconfig"
+
 config MM_OWNER
 	bool
 
diff --git a/kernel/Makefile b/kernel/Makefile
index e4791b3..71f9c68 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -93,6 +93,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
 obj-$(CONFIG_TRACING) += trace/
 obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_CR) += cr/
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan-HpYC8cTCicIJY7gZg3T8ig@public.gmane.org>, the -fno-omit-frame-pointer is
diff --git a/kernel/cr/Kconfig b/kernel/cr/Kconfig
new file mode 100644
index 0000000..bebef29
--- /dev/null
+++ b/kernel/cr/Kconfig
@@ -0,0 +1,11 @@
+config CR
+	bool "Container checkpoint/restart"
+	depends on IPC_NS || (SYSVIPC = n)
+	depends on NET_NS || (NET = n)
+	depends on PID_NS
+	depends on USER_NS
+	depends on UTS_NS
+	select FREEZER
+	depends on X86_32
+	help
+	  Container checkpoint/restart
diff --git a/kernel/cr/Makefile b/kernel/cr/Makefile
new file mode 100644
index 0000000..dc3dd49
--- /dev/null
+++ b/kernel/cr/Makefile
@@ -0,0 +1,8 @@
+obj-$(CONFIG_CR) += cr.o
+cr-y := cr-ctx.o
+cr-y += cpt-sys.o rst-sys.o
+cr-y += cpt-cred.o
+cr-y += cpt-fs.o
+cr-y += cpt-mm.o
+cr-y += cpt-ns.o
+cr-y += cpt-signal.o
diff --git a/kernel/cr/cpt-cred.c b/kernel/cr/cpt-cred.c
new file mode 100644
index 0000000..cdd1036
--- /dev/null
+++ b/kernel/cr/cpt-cred.c
@@ -0,0 +1,115 @@
+#include <linux/cr.h>
+#include <linux/cred.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include "cr.h"
+
+int cr_dump_cred(struct cr_context *ctx, struct cred *cred)
+{
+	struct cr_image_cred *i;
+
+	printk("%s: dump cred %p\n", __func__, cred);
+
+	i = kzalloc(sizeof(*i), GFP_KERNEL);
+	if (!i)
+		return -ENOMEM;
+	i->cr_type = CR_OBJ_CRED;
+	i->cr_len = sizeof(*i);
+
+	i->cr_uid = cred->uid;
+	i->cr_gid = cred->gid;
+	i->cr_suid = cred->suid;
+	i->cr_sgid = cred->sgid;
+	i->cr_euid = cred->euid;
+	i->cr_egid = cred->egid;
+	i->cr_fsuid = cred->fsuid;
+	i->cr_fsgid = cred->fsgid;
+	BUILD_BUG_ON(sizeof(cred->cap_inheritable) != 8);
+	memcpy(&i->cr_cap_inheritable, &cred->cap_inheritable, 8);
+	memcpy(&i->cr_cap_permitted, &cred->cap_permitted, 8);
+	memcpy(&i->cr_cap_effective, &cred->cap_effective, 8);
+	memcpy(&i->cr_cap_bset, &cred->cap_bset, 8);
+
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+
+	kfree(i);
+	return 0;
+}
+
+static int cr_check_cred(struct cred *cred)
+{
+	if (cred->securebits)
+		return -EINVAL;
+#ifdef CONFIG_KEYS
+	if (cred->thread_keyring || cred->request_key_auth || cred->tgcred)
+		return -EINVAL;
+#endif
+#ifdef CONFIG_SECURITY
+	if (cred->security)
+		return -EINVAL;
+#endif
+	return 0;
+}
+
+static int __cr_collect_cred(struct cr_context *ctx, struct cred *cred)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_cred) {
+		if (obj->o_obj == cred) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(cred);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_cred);
+	printk("%s: collect cred %p\n", __func__, cred);
+	return 0;
+}
+
+int cr_collect_cred(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_task_struct) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = cr_check_cred((struct cred *)tsk->real_cred);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_cred(ctx, (struct cred *)tsk->real_cred);
+		if (rv < 0)
+			return rv;
+		rv = cr_check_cred((struct cred *)tsk->cred);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_cred(ctx, (struct cred *)tsk->cred);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_file) {
+		struct file *file = obj->o_obj;
+
+		rv = cr_check_cred((struct cred *)file->f_cred);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_cred(ctx, (struct cred *)file->f_cred);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_cred) {
+		struct cred *cred = obj->o_obj;
+		unsigned int cnt = atomic_read(&cred->usage);
+
+		if (obj->o_count != cnt) {
+			printk("%s: cred %p has external references %u:%u\n", __func__, cred, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
diff --git a/kernel/cr/cpt-fs.c b/kernel/cr/cpt-fs.c
new file mode 100644
index 0000000..3fd6d0d
--- /dev/null
+++ b/kernel/cr/cpt-fs.c
@@ -0,0 +1,122 @@
+#include <linux/fdtable.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/sched.h>
+#include "cr.h"
+
+static int cr_check_file(struct file *file)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+		/* Likely on-disk filesystem. */
+		/* FIXME: FUSE, NFS, other networking filesystems */
+		if (inode->i_sb->s_type->fs_flags & FS_REQUIRES_DEV)
+			return 0;
+		break;
+	case S_IFBLK:
+		break;
+	case S_IFCHR:
+		break;
+	case S_IFIFO:
+		break;
+	case S_IFSOCK:
+		break;
+	case S_IFLNK:
+		/* One can't open symlink. */
+		BUG();
+	}
+	printk("%s: can't checkpoint file %p, ->f_op = %pS\n", __func__, file, file->f_op);
+	return -EINVAL;
+}
+
+int __cr_collect_file(struct cr_context *ctx, struct file *file)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_file) {
+		if (obj->o_obj == file) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(file);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_file);
+	printk("%s: collect file %p\n", __func__, file);
+	return 0;
+}
+
+static int __cr_collect_files_struct(struct cr_context *ctx, struct files_struct *fs)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_files_struct) {
+		if (obj->o_obj == fs) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(fs);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_files_struct);
+	printk("%s: collect files_struct %p\n", __func__, fs);
+	return 0;
+}
+
+int cr_collect_files_struct(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_task_struct) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = __cr_collect_files_struct(ctx, tsk->files);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_files_struct) {
+		struct files_struct *fs = obj->o_obj;
+		unsigned int cnt = atomic_read(&fs->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: files_struct %p has external references %u:%u\n", __func__, fs, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	for_each_cr_object(ctx, obj, cr_files_struct) {
+		struct files_struct *fs = obj->o_obj;
+		int fd;
+
+		for (fd = 0; fd < files_fdtable(fs)->max_fds; fd++) {
+			struct file *file;
+
+			file = fcheck_files(fs, fd);
+			if (file) {
+				rv = cr_check_file(file);
+				if (rv < 0)
+					return rv;
+				rv = __cr_collect_file(ctx, file);
+				if (rv < 0)
+					return rv;
+			}
+		}
+	}
+	for_each_cr_object(ctx, obj, cr_file) {
+		struct file *file = obj->o_obj;
+		unsigned long cnt = atomic_long_read(&file->f_count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: file %p/%pS has external references %u:%lu\n", __func__, file, file->f_op, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
diff --git a/kernel/cr/cpt-mm.c b/kernel/cr/cpt-mm.c
new file mode 100644
index 0000000..e7e1ff0
--- /dev/null
+++ b/kernel/cr/cpt-mm.c
@@ -0,0 +1,134 @@
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/sched.h>
+#include "cr.h"
+
+static int cr_check_vma(struct vm_area_struct *vma)
+{
+	unsigned long flags = vma->vm_flags;
+
+	printk("%s: vma = %p, ->vm_flags = 0x%lx\n", __func__, vma, flags);
+	/* Flags, we know and love. */
+	flags &= ~VM_READ;
+	flags &= ~VM_WRITE;
+	flags &= ~VM_EXEC;
+	flags &= ~VM_MAYREAD;
+	flags &= ~VM_MAYWRITE;
+	flags &= ~VM_MAYEXEC;
+	flags &= ~VM_GROWSDOWN;
+	flags &= ~VM_DENYWRITE;
+	flags &= ~VM_EXECUTABLE;
+	flags &= ~VM_DONTEXPAND;
+	flags &= ~VM_ACCOUNT;
+	flags &= ~VM_ALWAYSDUMP;
+	flags &= ~VM_CAN_NONLINEAR;
+	/* Flags, we don't know and don't love. */
+	if (flags) {
+		printk("%s: vma = %p, unknown ->vm_flags 0x%lx\n", __func__, vma, flags);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int cr_check_mm(struct mm_struct *mm, struct task_struct *tsk)
+{
+	if (!mm)
+		return -EINVAL;
+	down_read(&mm->mmap_sem);
+	if (mm->core_state) {
+		up_read(&mm->mmap_sem);
+		return -EINVAL;
+	}
+	up_read(&mm->mmap_sem);
+#ifdef CONFIG_AIO
+	spin_lock(&mm->ioctx_lock);
+	if (!hlist_empty(&mm->ioctx_list)) {
+		spin_unlock(&mm->ioctx_lock);
+		return -EINVAL;
+	}
+	spin_unlock(&mm->ioctx_lock);
+#endif
+#ifdef CONFIG_MM_OWNER
+	if (mm->owner != tsk)
+		return -EINVAL;
+#endif
+#ifdef CONFIG_MMU_NOTIFIER
+	down_read(&mm->mmap_sem);
+	if (mm_has_notifiers(mm)) {
+		up_read(&mm->mmap_sem);
+		return -EINVAL;
+	}
+	up_read(&mm->mmap_sem);
+#endif
+	return 0;
+}
+
+static int __cr_collect_mm(struct cr_context *ctx, struct mm_struct *mm)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_mm_struct) {
+		if (obj->o_obj == mm) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(mm);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_mm_struct);
+	printk("%s: collect mm_struct %p\n", __func__, mm);
+	return 0;
+}
+
+int cr_collect_mm(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_task_struct) {
+		struct task_struct *tsk = obj->o_obj;
+		struct mm_struct *mm = tsk->mm;
+
+		rv = cr_check_mm(mm, tsk);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_mm(ctx, mm);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_mm_struct) {
+		struct mm_struct *mm = obj->o_obj;
+		unsigned int cnt = atomic_read(&mm->mm_users);
+
+		if (obj->o_count != cnt) {
+			printk("%s: mm_struct %p has external references %u:%u\n", __func__, mm, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	for_each_cr_object(ctx, obj, cr_mm_struct) {
+		struct mm_struct *mm = obj->o_obj;
+		struct vm_area_struct *vma;
+
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			rv = cr_check_vma(vma);
+			if (rv < 0)
+				return rv;
+			if (vma->vm_file) {
+				rv = __cr_collect_file(ctx, vma->vm_file);
+				if (rv < 0)
+					return rv;
+			}
+		}
+#ifdef CONFIG_PROC_FS
+		if (mm->exe_file) {
+			rv = __cr_collect_file(ctx, mm->exe_file);
+			if (rv < 0)
+				return rv;
+		}
+#endif
+	}
+	return 0;
+}
diff --git a/kernel/cr/cpt-ns.c b/kernel/cr/cpt-ns.c
new file mode 100644
index 0000000..0cbf964
--- /dev/null
+++ b/kernel/cr/cpt-ns.c
@@ -0,0 +1,324 @@
+#include <linux/cr.h>
+#include <linux/ipc_namespace.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/mnt_namespace.h>
+#include <linux/pid_namespace.h>
+#include <linux/utsname.h>
+#include <net/net_namespace.h>
+#include "cr.h"
+
+int cr_dump_uts_ns(struct cr_context *ctx, struct uts_namespace *uts_ns)
+{
+	struct cr_image_uts_ns *i;
+
+	printk("%s: dump uts_ns %p\n", __func__, uts_ns);
+
+	i = kzalloc(sizeof(*i), GFP_KERNEL);
+	if (!i)
+		return -ENOMEM;
+	i->cr_type = CR_OBJ_UTS_NS;
+	i->cr_len = sizeof(*i);
+
+	strncpy((char *)i->cr_sysname, (const char *)uts_ns->name.sysname, 64);
+	strncpy((char *)i->cr_nodename, (const char *)uts_ns->name.nodename, 64);
+	strncpy((char *)i->cr_release, (const char *)uts_ns->name.release, 64);
+	strncpy((char *)i->cr_version, (const char *)uts_ns->name.version, 64);
+	strncpy((char *)i->cr_machine, (const char *)uts_ns->name.machine, 64);
+	strncpy((char *)i->cr_domainname, (const char *)uts_ns->name.domainname, 64);
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+
+	kfree(i);
+	return 0;
+}
+
+static int __cr_collect_uts_ns(struct cr_context *ctx, struct uts_namespace *uts_ns)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_uts_ns) {
+		if (obj->o_obj == uts_ns) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(uts_ns);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_uts_ns);
+	printk("%s: collect uts_ns %p\n", __func__, uts_ns);
+	return 0;
+}
+
+static int cr_collect_uts_ns(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_nsproxy) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = __cr_collect_uts_ns(ctx, nsproxy->uts_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_uts_ns) {
+		struct uts_namespace *uts_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&uts_ns->kref.refcount);
+
+		if (obj->o_count != cnt) {
+			printk("%s: uts_ns %p has external references %u:%u\n", __func__, uts_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+#ifdef CONFIG_SYSVIPC
+static int __cr_collect_ipc_ns(struct cr_context *ctx, struct ipc_namespace *ipc_ns)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_ipc_ns) {
+		if (obj->o_obj == ipc_ns) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(ipc_ns);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_ipc_ns);
+	printk("%s: collect ipc_ns %p\n", __func__, ipc_ns);
+	return 0;
+}
+
+static int cr_collect_ipc_ns(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_nsproxy) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = __cr_collect_ipc_ns(ctx, nsproxy->ipc_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_ipc_ns) {
+		struct ipc_namespace *ipc_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&ipc_ns->kref.refcount);
+
+		if (obj->o_count != cnt) {
+			printk("%s: ipc_ns %p has external references %u:%u\n", __func__, ipc_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+#else
+static int cr_collect_ipc_ns(struct cr_context *ctx)
+{
+	return 0;
+}
+#endif
+
+static int __cr_collect_mnt_ns(struct cr_context *ctx, struct mnt_namespace *mnt_ns)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_mnt_ns) {
+		if (obj->o_obj == mnt_ns) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(mnt_ns);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_mnt_ns);
+	printk("%s: collect mnt_ns %p\n", __func__, mnt_ns);
+	return 0;
+}
+
+static int cr_collect_mnt_ns(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_nsproxy) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = __cr_collect_mnt_ns(ctx, nsproxy->mnt_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_mnt_ns) {
+		struct mnt_namespace *mnt_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&mnt_ns->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: mnt_ns %p has external references %u:%u\n", __func__, mnt_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int __cr_collect_pid_ns(struct cr_context *ctx, struct pid_namespace *pid_ns)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_pid_ns) {
+		if (obj->o_obj == pid_ns) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(pid_ns);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_pid_ns);
+	printk("%s: collect pid_ns %p\n", __func__, pid_ns);
+	return 0;
+}
+
+static int cr_collect_pid_ns(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_nsproxy) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = __cr_collect_pid_ns(ctx, nsproxy->pid_ns);
+		if (rv < 0)
+			return rv;
+	}
+	/*
+	 * FIXME: check for external pid_ns references
+	 * 1. struct pid pins pid_ns
+	 * 2. struct pid_namespace pins pid_ns, but only parent one
+	 */
+	return 0;
+}
+
+#ifdef CONFIG_NET
+static int __cr_collect_net_ns(struct cr_context *ctx, struct net *net_ns)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_net_ns) {
+		if (obj->o_obj == net_ns) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(net_ns);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_net_ns);
+	printk("%s: collect net_ns %p\n", __func__, net_ns);
+	return 0;
+}
+
+static int cr_collect_net_ns(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_nsproxy) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = __cr_collect_net_ns(ctx, nsproxy->net_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_net_ns) {
+		struct net *net_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&net_ns->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: net_ns %p has external references %u:%u\n", __func__, net_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+#else
+static int cr_collect_net_ns(struct cr_context *ctx)
+{
+	return 0;
+}
+#endif
+
+static int __cr_collect_nsproxy(struct cr_context *ctx, struct nsproxy *nsproxy)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_nsproxy) {
+		if (obj->o_obj == nsproxy) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(nsproxy);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_nsproxy);
+	printk("%s: collect nsproxy %p\n", __func__, nsproxy);
+	return 0;
+}
+
+int cr_collect_nsproxy(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_task_struct) {
+		struct task_struct *tsk = obj->o_obj;
+		struct nsproxy *nsproxy;
+
+		rcu_read_lock();
+		nsproxy = task_nsproxy(tsk);
+		rcu_read_unlock();
+		if (!nsproxy)
+			return -EAGAIN;
+
+		rv = __cr_collect_nsproxy(ctx, nsproxy);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_nsproxy) {
+		struct nsproxy *nsproxy = obj->o_obj;
+		unsigned int cnt = atomic_read(&nsproxy->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: nsproxy %p has external references %u:%u\n", __func__, nsproxy, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	rv = cr_collect_uts_ns(ctx);
+	if (rv < 0)
+		return rv;
+	rv = cr_collect_ipc_ns(ctx);
+	if (rv < 0)
+		return rv;
+	rv = cr_collect_mnt_ns(ctx);
+	if (rv < 0)
+		return rv;
+	rv = cr_collect_pid_ns(ctx);
+	if (rv < 0)
+		return rv;
+	rv = cr_collect_net_ns(ctx);
+	if (rv < 0)
+		return rv;
+	return 0;
+}
diff --git a/kernel/cr/cpt-signal.c b/kernel/cr/cpt-signal.c
new file mode 100644
index 0000000..cb074f5
--- /dev/null
+++ b/kernel/cr/cpt-signal.c
@@ -0,0 +1,121 @@
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include "cr.h"
+
+static int cr_check_signal(struct signal_struct *signal)
+{
+	if (!signal)
+		return -EINVAL;
+	if (!list_empty(&signal->posix_timers))
+		return -EINVAL;
+#ifdef CONFIG_KEYS
+	if (signal->session_keyring || signal->process_keyring)
+		return -EINVAL;
+#endif
+	return 0;
+}
+
+static int __cr_collect_signal(struct cr_context *ctx, struct signal_struct *signal)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_signal) {
+		if (obj->o_obj == signal) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(signal);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_signal);
+	printk("%s: collect signal_struct %p\n", __func__, signal);
+	return 0;
+}
+
+int cr_collect_signal(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_task_struct) {
+		struct task_struct *tsk = obj->o_obj;
+		struct signal_struct *signal = tsk->signal;
+
+		rv = cr_check_signal(signal);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_signal(ctx, signal);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_signal) {
+		struct signal_struct *signal = obj->o_obj;
+		unsigned int cnt = atomic_read(&signal->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: signal_struct %p has external references %u:%u\n", __func__, signal, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+static int cr_check_sighand(struct sighand_struct *sighand)
+{
+	if (!sighand)
+		return -EINVAL;
+#ifdef CONFIG_SIGNALFD
+	if (waitqueue_active(&sighand->signalfd_wqh))
+		return -EINVAL;
+#endif
+	return 0;
+}
+
+static int __cr_collect_sighand(struct cr_context *ctx, struct sighand_struct *sighand)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_sighand) {
+		if (obj->o_obj == sighand) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(sighand);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_sighand);
+	printk("%s: collect sighand_struct %p\n", __func__, sighand);
+	return 0;
+}
+
+int cr_collect_sighand(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, cr_task_struct) {
+		struct task_struct *tsk = obj->o_obj;
+		struct sighand_struct *sighand = tsk->sighand;
+
+		rv = cr_check_sighand(sighand);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_sighand(ctx, sighand);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_sighand) {
+		struct sighand_struct *sighand = obj->o_obj;
+		unsigned int cnt = atomic_read(&sighand->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: sighand_struct %p has external references %u:%u\n", __func__, sighand, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
diff --git a/kernel/cr/cpt-sys.c b/kernel/cr/cpt-sys.c
new file mode 100644
index 0000000..27d3678
--- /dev/null
+++ b/kernel/cr/cpt-sys.c
@@ -0,0 +1,228 @@
+#include <linux/capability.h>
+#include <linux/cr.h>
+#include <linux/file.h>
+#include <linux/freezer.h>
+#include <linux/fs.h>
+#include <linux/nsproxy.h>
+#include <linux/pid_namespace.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/utsname.h>
+#include "cr.h"
+
+/* 'tsk' is child of 'parent' in some generation. */
+static int child_of(struct task_struct *parent, struct task_struct *tsk)
+{
+	struct task_struct *tmp = tsk;
+
+	while (tmp != &init_task) {
+		if (tmp == parent)
+			return 1;
+		tmp = tmp->real_parent;
+	}
+	/* In case 'parent' is 'init_task'. */
+	return tmp == parent;
+}
+
+static int cr_freeze_tasks(struct task_struct *init_tsk)
+{
+	struct task_struct *tmp, *tsk;
+
+	read_lock(&tasklist_lock);
+	do_each_thread(tmp, tsk) {
+		if (child_of(init_tsk, tsk)) {
+			if (!freeze_task(tsk, 1)) {
+				printk("%s: freezing '%s' failed\n", __func__, tsk->comm);
+				read_unlock(&tasklist_lock);
+				return -EBUSY;
+			}
+		}
+	} while_each_thread(tmp, tsk);
+	read_unlock(&tasklist_lock);
+	return 0;
+}
+
+static void cr_thaw_tasks(struct task_struct *init_tsk)
+{
+	struct task_struct *tmp, *tsk;
+
+	read_lock(&tasklist_lock);
+	do_each_thread(tmp, tsk) {
+		if (child_of(init_tsk, tsk))
+			thaw_process(tsk);
+	} while_each_thread(tmp, tsk);
+	read_unlock(&tasklist_lock);
+}
+
+static int __cr_collect_task(struct cr_context *ctx, struct task_struct *tsk)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, cr_task_struct) {
+		BUG_ON(obj->o_obj == tsk);
+	}
+
+	obj = cr_object_create(tsk);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_task_struct);
+	get_task_struct(tsk);
+	printk("%s: collect task %p/%s\n", __func__, tsk, tsk->comm);
+	return 0;
+}
+
+static int cr_collect_tasks(struct cr_context *ctx, struct task_struct *init_tsk)
+{
+	struct cr_object *obj;
+	int rv;
+
+	rv = __cr_collect_task(ctx, init_tsk);
+	if (rv < 0)
+		return rv;
+
+	for_each_cr_object(ctx, obj, cr_task_struct) {
+		struct task_struct *tsk = obj->o_obj, *child;
+
+		/* Collect threads. */
+		if (thread_group_leader(tsk)) {
+			struct task_struct *thread = tsk;
+
+			while ((thread = next_thread(thread)) != tsk) {
+				rv = __cr_collect_task(ctx, thread);
+				if (rv < 0)
+					return rv;
+			}
+		}
+
+		/* Collect children. */
+		list_for_each_entry(child, &tsk->children, sibling) {
+			rv = __cr_collect_task(ctx, child);
+			if (rv < 0)
+				return rv;
+		}
+	}
+	return 0;
+}
+
+static void cr_dump_header(struct cr_context *ctx)
+{
+	struct cr_header hdr;
+
+	memset(&hdr, 0, sizeof(struct cr_header));
+	hdr.cr_signature[0] = 'L';
+	hdr.cr_signature[1] = 'i';
+	hdr.cr_signature[2] = 'n';
+	hdr.cr_signature[3] = 'u';
+	hdr.cr_signature[4] = 'x';
+	hdr.cr_signature[5] = 'C';
+	hdr.cr_signature[6] = '/';
+	hdr.cr_signature[7] = 'R';
+	hdr.cr_image_version = cpu_to_le64(CR_IMAGE_VERSION);
+	strncpy((char *)&hdr.cr_uts_release, (const char *)init_uts_ns.name.release, 64);
+#ifdef CONFIG_X86_32
+	hdr.cr_arch = cpu_to_le32(CR_ARCH_X86_32);
+#endif
+	cr_write(ctx, &hdr, sizeof(struct cr_header));
+	cr_align(ctx);
+}
+
+static int cr_dump(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	cr_dump_header(ctx);
+
+	for_each_cr_object(ctx, obj, cr_uts_ns) {
+		rv = cr_dump_uts_ns(ctx, obj->o_obj);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, cr_cred) {
+		rv = cr_dump_cred(ctx, obj->o_obj);
+		if (rv < 0)
+			return 0;
+	}
+	return 0;
+}
+
+SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
+{
+	struct cr_context *ctx;
+	struct file *file;
+	struct task_struct *init_tsk = NULL, *tsk;
+	int rv = 0;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EBADF;
+	if (!file->f_op || !file->f_op->write)
+		return -EINVAL;
+
+	rcu_read_lock();
+	tsk = find_task_by_vpid(pid);
+	if (tsk) {
+		init_tsk = task_nsproxy(tsk)->pid_ns->child_reaper;
+		get_task_struct(init_tsk);
+	}
+	rcu_read_unlock();
+	if (!init_tsk) {
+		rv = -ESRCH;
+		goto out_no_init_tsk;
+	}
+
+	ctx = cr_context_create(init_tsk, file);
+	if (!ctx) {
+		rv = -ENOMEM;
+		goto out_ctx_alloc;
+	}
+
+	rv = cr_freeze_tasks(init_tsk);
+	if (rv < 0)
+		goto out_freeze;
+	rv = cr_collect_tasks(ctx, init_tsk);
+	if (rv < 0)
+		goto out_collect_tasks;
+	rv = cr_collect_nsproxy(ctx);
+	if (rv < 0)
+		goto out_collect_nsproxy;
+	rv = cr_collect_mm(ctx);
+	if (rv < 0)
+		goto out_collect_mm;
+	rv = cr_collect_files_struct(ctx);
+	if (rv < 0)
+		goto out_collect_files_struct;
+	/* After tasks and after files. */
+	rv = cr_collect_cred(ctx);
+	if (rv < 0)
+		goto out_collect_cred;
+	rv = cr_collect_signal(ctx);
+	if (rv < 0)
+		goto out_collect_signal;
+	rv = cr_collect_sighand(ctx);
+	if (rv < 0)
+		goto out_collect_sighand;
+
+	rv = cr_dump(ctx);
+
+out_collect_sighand:
+out_collect_signal:
+out_collect_cred:
+out_collect_files_struct:
+out_collect_mm:
+out_collect_nsproxy:
+out_collect_tasks:
+	cr_thaw_tasks(init_tsk);
+out_freeze:
+	cr_context_destroy(ctx);
+out_ctx_alloc:
+	put_task_struct(init_tsk);
+out_no_init_tsk:
+	fput(file);
+	return rv;
+}
diff --git a/kernel/cr/cr-ctx.c b/kernel/cr/cr-ctx.c
new file mode 100644
index 0000000..b203c89
--- /dev/null
+++ b/kernel/cr/cr-ctx.c
@@ -0,0 +1,141 @@
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/nsproxy.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <asm/processor.h>
+#include <asm/uaccess.h>
+#include "cr.h"
+
+void cr_write(struct cr_context *ctx, const void *buf, size_t count)
+{
+	struct file *file = ctx->cr_dump_file;
+	mm_segment_t old_fs;
+	ssize_t rv;
+
+	if (ctx->cr_write_error)
+		return;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	rv = file->f_op->write(file, (const char __user *)buf, count, &file->f_pos);
+	set_fs(old_fs);
+	if (rv != count)
+		ctx->cr_write_error = (rv < 0) ? rv : -EIO;
+}
+
+void cr_align(struct cr_context *ctx)
+{
+	struct file *file = ctx->cr_dump_file;
+
+	file->f_pos = ALIGN(file->f_pos, 8);
+}
+
+struct cr_object *cr_object_create(void *data)
+{
+	struct cr_object *obj;
+
+	obj = kmalloc(sizeof(struct cr_object), GFP_KERNEL);
+	if (obj) {
+		obj->o_count = 1;
+		obj->o_obj = data;
+	}
+	return obj;
+}
+
+struct cr_context *cr_context_create(struct task_struct *tsk, struct file *file)
+{
+	struct cr_context *ctx;
+
+	ctx = kmalloc(sizeof(struct cr_context), GFP_KERNEL);
+	if (ctx) {
+		ctx->cr_init_tsk = tsk;
+		ctx->cr_dump_file = file;
+		ctx->cr_write_error = 0;
+
+		INIT_LIST_HEAD(&ctx->cr_task_struct);
+		INIT_LIST_HEAD(&ctx->cr_nsproxy);
+		INIT_LIST_HEAD(&ctx->cr_uts_ns);
+#ifdef CONFIG_SYSVIPC
+		INIT_LIST_HEAD(&ctx->cr_ipc_ns);
+#endif
+		INIT_LIST_HEAD(&ctx->cr_mnt_ns);
+		INIT_LIST_HEAD(&ctx->cr_pid_ns);
+#ifdef CONFIG_NET
+		INIT_LIST_HEAD(&ctx->cr_net_ns);
+#endif
+		INIT_LIST_HEAD(&ctx->cr_mm_struct);
+		INIT_LIST_HEAD(&ctx->cr_files_struct);
+		INIT_LIST_HEAD(&ctx->cr_file);
+		INIT_LIST_HEAD(&ctx->cr_cred);
+		INIT_LIST_HEAD(&ctx->cr_signal);
+		INIT_LIST_HEAD(&ctx->cr_sighand);
+	}
+	return ctx;
+}
+
+void cr_context_destroy(struct cr_context *ctx)
+{
+	struct cr_object *obj, *tmp;
+
+	for_each_cr_object_safe(ctx, obj, tmp, cr_sighand) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+	for_each_cr_object_safe(ctx, obj, tmp, cr_signal) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+	for_each_cr_object_safe(ctx, obj, tmp, cr_cred) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+	for_each_cr_object_safe(ctx, obj, tmp, cr_file) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+	for_each_cr_object_safe(ctx, obj, tmp, cr_files_struct) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+	for_each_cr_object_safe(ctx, obj, tmp, cr_mm_struct) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+#ifdef CONFIG_NET
+	for_each_cr_object_safe(ctx, obj, tmp, cr_net_ns) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+#endif
+	for_each_cr_object_safe(ctx, obj, tmp, cr_pid_ns) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+	for_each_cr_object_safe(ctx, obj, tmp, cr_mnt_ns) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+#ifdef CONFIG_SYSVIPC
+	for_each_cr_object_safe(ctx, obj, tmp, cr_ipc_ns) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+#endif
+	for_each_cr_object_safe(ctx, obj, tmp, cr_uts_ns) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+	for_each_cr_object_safe(ctx, obj, tmp, cr_nsproxy) {
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+	for_each_cr_object_safe(ctx, obj, tmp, cr_task_struct) {
+		struct task_struct *tsk = obj->o_obj;
+
+		put_task_struct(tsk);
+		list_del(&obj->o_list);
+		cr_object_destroy(obj);
+	}
+	kfree(ctx);
+}
diff --git a/kernel/cr/cr.h b/kernel/cr/cr.h
new file mode 100644
index 0000000..73a9fd9
--- /dev/null
+++ b/kernel/cr/cr.h
@@ -0,0 +1,61 @@
+#ifndef __CR_H
+#define __CR_H
+
+struct cr_object {
+	struct list_head	o_list;	/* entry in ->cr_* lists */
+	void			*o_obj;	/* pointer to object being collected/dumped */
+	unsigned int		o_count;/* number of references from collected objects */
+};
+
+struct cr_context {
+	struct task_struct	*cr_init_tsk;
+	struct file		*cr_dump_file;
+	int			cr_write_error;
+
+	struct list_head	cr_task_struct;
+	struct list_head	cr_nsproxy;
+	struct list_head	cr_uts_ns;
+#ifdef CONFIG_SYSVIPC
+	struct list_head	cr_ipc_ns;
+#endif
+	struct list_head	cr_mnt_ns;
+	struct list_head	cr_pid_ns;
+#ifdef CONFIG_NET
+	struct list_head	cr_net_ns;
+#endif
+	struct list_head	cr_mm_struct;
+	struct list_head	cr_files_struct;
+	struct list_head	cr_file;
+	struct list_head	cr_cred;
+	struct list_head	cr_signal;
+	struct list_head	cr_sighand;
+};
+
+#define for_each_cr_object(ctx, obj, lh)		\
+	list_for_each_entry(obj, &ctx->lh, o_list)
+#define for_each_cr_object_safe(ctx, obj, tmp, lh)	\
+	list_for_each_entry_safe(obj, tmp, &ctx->lh, o_list)
+
+struct cr_object *cr_object_create(void *data);
+static inline void cr_object_destroy(struct cr_object *obj)
+{
+	kfree(obj);
+}
+
+struct cr_context *cr_context_create(struct task_struct *tsk, struct file *file);
+void cr_context_destroy(struct cr_context *ctx);
+
+int cr_collect_nsproxy(struct cr_context *ctx);
+int cr_collect_cred(struct cr_context *ctx);
+int cr_collect_signal(struct cr_context *ctx);
+int cr_collect_sighand(struct cr_context *ctx);
+int cr_collect_mm(struct cr_context *ctx);
+int __cr_collect_file(struct cr_context *ctx, struct file *file);
+int cr_collect_files_struct(struct cr_context *ctx);
+
+void cr_write(struct cr_context *ctx, const void *buf, size_t count);
+void cr_align(struct cr_context *ctx);
+
+int cr_dump_uts_ns(struct cr_context *ctx, struct uts_namespace *uts_ns);
+int cr_dump_cred(struct cr_context *ctx, struct cred *cred);
+#endif
diff --git a/kernel/cr/rst-sys.c b/kernel/cr/rst-sys.c
new file mode 100644
index 0000000..35c3d15
--- /dev/null
+++ b/kernel/cr/rst-sys.c
@@ -0,0 +1,9 @@
+#include <linux/capability.h>
+#include <linux/syscalls.h>
+
+SYSCALL_DEFINE2(restart, int, fd, unsigned long, flags)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	return -ENOSYS;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 27dad29..da4fbf6 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -175,3 +175,6 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-26 22:31                       ` Alexey Dobriyan
@ 2009-02-27  9:03                         ` Ingo Molnar
  2009-02-27  9:19                           ` Andrew Morton
       [not found]                           ` <20090227090323.GC16211-X9Un+BFzKDI@public.gmane.org>
  2009-02-27 16:14                         ` Dave Hansen
       [not found]                         ` <20090226223112.GA2939-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
  2 siblings, 2 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-02-27  9:03 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Andrew Morton, Dave Hansen, mpm, containers, hpa, linux-kernel,
	linux-mm, viro, linux-api, torvalds, tglx, xemul
* Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > I think the main question is: will we ever find ourselves in 
> > the future saying that "C/R sucks, nobody but a small 
> > minority uses it, wish we had never merged it"? I think the 
> > likelyhood of that is very low. I think the current OpenVZ 
> > stuff already looks very useful, and i dont think we've 
> > realized (let alone explored) all the possibilities yet.
> 
> This is collecting and start of dumping part of cleaned up 
> OpenVZ C/R implementation, FYI.
> 
>  arch/x86/include/asm/unistd_32.h   |    2 
>  arch/x86/kernel/syscall_table_32.S |    2 
>  include/linux/Kbuild               |    1 
>  include/linux/cr.h                 |   56 ++++++
>  include/linux/ipc_namespace.h      |    3 
>  include/linux/syscalls.h           |    5 
>  init/Kconfig                       |    2 
>  kernel/Makefile                    |    1 
>  kernel/cr/Kconfig                  |   11 +
>  kernel/cr/Makefile                 |    8 
>  kernel/cr/cpt-cred.c               |  115 +++++++++++++
>  kernel/cr/cpt-fs.c                 |  122 +++++++++++++
>  kernel/cr/cpt-mm.c                 |  134 +++++++++++++++
>  kernel/cr/cpt-ns.c                 |  324 +++++++++++++++++++++++++++++++++++++
>  kernel/cr/cpt-signal.c             |  121 +++++++++++++
>  kernel/cr/cpt-sys.c                |  228 ++++++++++++++++++++++++++
>  kernel/cr/cr-ctx.c                 |  141 ++++++++++++++++
>  kernel/cr/cr.h                     |   61 ++++++
>  kernel/cr/rst-sys.c                |    9 +
>  kernel/sys_ni.c                    |    3 
>  20 files changed, 1349 insertions(+)
That does not look scary to me at all. Andrew?
Before going into any fine details, a small high-level structure 
nit: the namespace is fine in kernel/cr/ too i guess, but 
wouldnt it be even better to move it close to their respective 
subsystems? mm/checkpoint.c, etc.?
Just like we have mm/nommu.c fs/proc/nommu.c, etc. - not 
kernel/nommu/mm.c kernel/nommu/proc.c.
I realize that for your forward-porting efforts it was a good 
idea to keep it all separated, but once we move this upstream 
the organization should be in close proximity of the code it 
affects.
That will have another advantage as well: the folks maintaining 
those subsystems will be more aware of it.
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-27  9:03                         ` Ingo Molnar
@ 2009-02-27  9:19                           ` Andrew Morton
  2009-02-27 10:57                             ` Alexey Dobriyan
       [not found]                           ` <20090227090323.GC16211-X9Un+BFzKDI@public.gmane.org>
  1 sibling, 1 reply; 121+ messages in thread
From: Andrew Morton @ 2009-02-27  9:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexey Dobriyan, Dave Hansen, mpm, containers, hpa, linux-kernel,
	linux-mm, viro, linux-api, torvalds, tglx, xemul
On Fri, 27 Feb 2009 10:03:23 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> 
> * Alexey Dobriyan <adobriyan@gmail.com> wrote:
> 
> > > I think the main question is: will we ever find ourselves in 
> > > the future saying that "C/R sucks, nobody but a small 
> > > minority uses it, wish we had never merged it"? I think the 
> > > likelyhood of that is very low. I think the current OpenVZ 
> > > stuff already looks very useful, and i dont think we've 
> > > realized (let alone explored) all the possibilities yet.
> > 
> > This is collecting and start of dumping part of cleaned up 
> > OpenVZ C/R implementation, FYI.
> > 
> >  arch/x86/include/asm/unistd_32.h   |    2 
> >  arch/x86/kernel/syscall_table_32.S |    2 
> >  include/linux/Kbuild               |    1 
> >  include/linux/cr.h                 |   56 ++++++
> >  include/linux/ipc_namespace.h      |    3 
> >  include/linux/syscalls.h           |    5 
> >  init/Kconfig                       |    2 
> >  kernel/Makefile                    |    1 
> >  kernel/cr/Kconfig                  |   11 +
> >  kernel/cr/Makefile                 |    8 
> >  kernel/cr/cpt-cred.c               |  115 +++++++++++++
> >  kernel/cr/cpt-fs.c                 |  122 +++++++++++++
> >  kernel/cr/cpt-mm.c                 |  134 +++++++++++++++
> >  kernel/cr/cpt-ns.c                 |  324 +++++++++++++++++++++++++++++++++++++
> >  kernel/cr/cpt-signal.c             |  121 +++++++++++++
> >  kernel/cr/cpt-sys.c                |  228 ++++++++++++++++++++++++++
> >  kernel/cr/cr-ctx.c                 |  141 ++++++++++++++++
> >  kernel/cr/cr.h                     |   61 ++++++
> >  kernel/cr/rst-sys.c                |    9 +
> >  kernel/sys_ni.c                    |    3 
> >  20 files changed, 1349 insertions(+)
> 
> That does not look scary to me at all. Andrew?
I think we'd need to look into the details.  Sure, it's isolated from a
where-it-is-in-the-tree POV.  But I assume that each of those files has
intimate and intrusive knowledge of the internals of data structures?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                           ` <20090226221709.GA2924-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
@ 2009-02-27  9:19                             ` Greg Kurz
  2009-02-27 10:53                               ` Alexey Dobriyan
  0 siblings, 1 reply; 121+ messages in thread
From: Greg Kurz @ 2009-02-27  9:19 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Fri, 2009-02-27 at 01:17 +0300, Alexey Dobriyan wrote:
> On Thu, Feb 26, 2009 at 07:30:16PM +0100, Greg Kurz wrote:
> > On Thu, 2009-02-26 at 18:33 +0100, Ingo Molnar wrote:
> > > I think the main question is: will we ever find ourselves in the 
> > > future saying that "C/R sucks, nobody but a small minority uses 
> > > it, wish we had never merged it"? I think the likelyhood of that 
> > > is very low. I think the current OpenVZ stuff already looks very 
> > 
> > We've been maintaining for some years now a C/R middleware with only a
> > few hooks in the kernel. Our strategy is to leverage existing kernel
> > paths as they do most of the work right.
> > 
> > Most of the checkpoint is performed from userspace, using regular
> > syscalls in a signal handler or /proc parsing. Restart is a bit trickier
> > and needs some kernel support to bypass syscall checks and enforce a
> > specific id for a resource. At the end, we support C/R and live
> > migration of networking apps (websphere application server for example).
> > 
> > >From our experience, we can tell:
> > 
> > Pros: mostly not-so-tricky userland code, independent from kernel
> > internals
> > Cons: sub-optimal for some resources
> 
> How do you restore struct task_struct::did_exec ?
With sys_execve().
-- 
Gregory Kurz                                     gkurz-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)534 638 479                           Fax +33 (0)561 400 420
"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                           ` <20090227090323.GC16211-X9Un+BFzKDI@public.gmane.org>
@ 2009-02-27  9:22                             ` Andrew Morton
  2009-02-27 10:59                               ` Alexey Dobriyan
  0 siblings, 1 reply; 121+ messages in thread
From: Andrew Morton @ 2009-02-27  9:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Alexey Dobriyan,
	xemul-GEFAQzZX7r8dnm+yROfE0A
On Fri, 27 Feb 2009 10:03:23 +0100 Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
> >  arch/x86/include/asm/unistd_32.h   |    2 
> >  arch/x86/kernel/syscall_table_32.S |    2 
> >  include/linux/Kbuild               |    1 
> >  include/linux/cr.h                 |   56 ++++++
> >  include/linux/ipc_namespace.h      |    3 
> >  include/linux/syscalls.h           |    5 
> >  init/Kconfig                       |    2 
> >  kernel/Makefile                    |    1 
> >  kernel/cr/Kconfig                  |   11 +
> >  kernel/cr/Makefile                 |    8 
> >  kernel/cr/cpt-cred.c               |  115 +++++++++++++
> >  kernel/cr/cpt-fs.c                 |  122 +++++++++++++
> >  kernel/cr/cpt-mm.c                 |  134 +++++++++++++++
> >  kernel/cr/cpt-ns.c                 |  324 +++++++++++++++++++++++++++++++++++++
> >  kernel/cr/cpt-signal.c             |  121 +++++++++++++
> >  kernel/cr/cpt-sys.c                |  228 ++++++++++++++++++++++++++
> >  kernel/cr/cr-ctx.c                 |  141 ++++++++++++++++
> >  kernel/cr/cr.h                     |   61 ++++++
> >  kernel/cr/rst-sys.c                |    9 +
> >  kernel/sys_ni.c                    |    3 
> >  20 files changed, 1349 insertions(+)
> 
> That does not look scary to me at all. Andrew?
btw, why is there no need for a kernel/cr/cpt-net.c?
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-26 22:17                         ` Alexey Dobriyan
       [not found]                           ` <20090226221709.GA2924-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
@ 2009-02-27  9:36                           ` Cedric Le Goater
  1 sibling, 0 replies; 121+ messages in thread
From: Cedric Le Goater @ 2009-02-27  9:36 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Greg Kurz, linux-api, containers, mpm, linux-kernel, Dave Hansen,
	linux-mm, tglx, viro, hpa, Ingo Molnar, torvalds, Andrew Morton,
	xemul
Alexey Dobriyan wrote:
> On Thu, Feb 26, 2009 at 07:30:16PM +0100, Greg Kurz wrote:
>> On Thu, 2009-02-26 at 18:33 +0100, Ingo Molnar wrote:
>>> I think the main question is: will we ever find ourselves in the 
>>> future saying that "C/R sucks, nobody but a small minority uses 
>>> it, wish we had never merged it"? I think the likelyhood of that 
>>> is very low. I think the current OpenVZ stuff already looks very 
>> We've been maintaining for some years now a C/R middleware with only a
>> few hooks in the kernel. Our strategy is to leverage existing kernel
>> paths as they do most of the work right.
>>
>> Most of the checkpoint is performed from userspace, using regular
>> syscalls in a signal handler or /proc parsing. Restart is a bit trickier
>> and needs some kernel support to bypass syscall checks and enforce a
>> specific id for a resource. At the end, we support C/R and live
>> migration of networking apps (websphere application server for example).
>>
>> >From our experience, we can tell:
>>
>> Pros: mostly not-so-tricky userland code, independent from kernel
>> internals
>> Cons: sub-optimal for some resources
> 
> How do you restore struct task_struct::did_exec ?
greg didn't say there was _no_ kernel support.
without discussing the pros and cons of such and such implemention, full 
C/R from kernel means more maintenance work from kernel maintainers, so
it seems a good idea to leverage existing API when they exist. less work.
duplicating the get/set of the cpu state which is already done in the
signal handling is one example of extra work.
now, there's a definitely a need for kernel support for some resources. the 
question now is finding the right path, this is still work in progress IMHO.
C.
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-27  9:19                             ` Greg Kurz
@ 2009-02-27 10:53                               ` Alexey Dobriyan
  2009-02-27 14:33                                 ` Cedric Le Goater
  0 siblings, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-27 10:53 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Ingo Molnar, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Fri, Feb 27, 2009 at 10:19:09AM +0100, Greg Kurz wrote:
> On Fri, 2009-02-27 at 01:17 +0300, Alexey Dobriyan wrote:
> > On Thu, Feb 26, 2009 at 07:30:16PM +0100, Greg Kurz wrote:
> > > On Thu, 2009-02-26 at 18:33 +0100, Ingo Molnar wrote:
> > > > I think the main question is: will we ever find ourselves in the 
> > > > future saying that "C/R sucks, nobody but a small minority uses 
> > > > it, wish we had never merged it"? I think the likelyhood of that 
> > > > is very low. I think the current OpenVZ stuff already looks very 
> > > 
> > > We've been maintaining for some years now a C/R middleware with only a
> > > few hooks in the kernel. Our strategy is to leverage existing kernel
> > > paths as they do most of the work right.
> > > 
> > > Most of the checkpoint is performed from userspace, using regular
> > > syscalls in a signal handler or /proc parsing. Restart is a bit trickier
> > > and needs some kernel support to bypass syscall checks and enforce a
> > > specific id for a resource. At the end, we support C/R and live
> > > migration of networking apps (websphere application server for example).
> > > 
> > > >From our experience, we can tell:
> > > 
> > > Pros: mostly not-so-tricky userland code, independent from kernel
> > > internals
> > > Cons: sub-optimal for some resources
> > 
> > How do you restore struct task_struct::did_exec ?
> 
> With sys_execve().
How do you restore set of uts_namespace's? Kernel never exposes to
userspace which are the same, which are independent.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-27  9:19                           ` Andrew Morton
@ 2009-02-27 10:57                             ` Alexey Dobriyan
  0 siblings, 0 replies; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-27 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Dave Hansen, mpm, containers, hpa, linux-kernel,
	linux-mm, viro, linux-api, torvalds, tglx, xemul
On Fri, Feb 27, 2009 at 01:19:01AM -0800, Andrew Morton wrote:
> On Fri, 27 Feb 2009 10:03:23 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> 
> > 
> > * Alexey Dobriyan <adobriyan@gmail.com> wrote:
> > 
> > > > I think the main question is: will we ever find ourselves in 
> > > > the future saying that "C/R sucks, nobody but a small 
> > > > minority uses it, wish we had never merged it"? I think the 
> > > > likelyhood of that is very low. I think the current OpenVZ 
> > > > stuff already looks very useful, and i dont think we've 
> > > > realized (let alone explored) all the possibilities yet.
> > > 
> > > This is collecting and start of dumping part of cleaned up 
> > > OpenVZ C/R implementation, FYI.
> > > 
> > >  arch/x86/include/asm/unistd_32.h   |    2 
> > >  arch/x86/kernel/syscall_table_32.S |    2 
> > >  include/linux/Kbuild               |    1 
> > >  include/linux/cr.h                 |   56 ++++++
> > >  include/linux/ipc_namespace.h      |    3 
> > >  include/linux/syscalls.h           |    5 
> > >  init/Kconfig                       |    2 
> > >  kernel/Makefile                    |    1 
> > >  kernel/cr/Kconfig                  |   11 +
> > >  kernel/cr/Makefile                 |    8 
> > >  kernel/cr/cpt-cred.c               |  115 +++++++++++++
> > >  kernel/cr/cpt-fs.c                 |  122 +++++++++++++
> > >  kernel/cr/cpt-mm.c                 |  134 +++++++++++++++
> > >  kernel/cr/cpt-ns.c                 |  324 +++++++++++++++++++++++++++++++++++++
> > >  kernel/cr/cpt-signal.c             |  121 +++++++++++++
> > >  kernel/cr/cpt-sys.c                |  228 ++++++++++++++++++++++++++
> > >  kernel/cr/cr-ctx.c                 |  141 ++++++++++++++++
> > >  kernel/cr/cr.h                     |   61 ++++++
> > >  kernel/cr/rst-sys.c                |    9 +
> > >  kernel/sys_ni.c                    |    3 
> > >  20 files changed, 1349 insertions(+)
> > 
> > That does not look scary to me at all. Andrew?
> 
> I think we'd need to look into the details.  Sure, it's isolated from a
> where-it-is-in-the-tree POV.  But I assume that each of those files has
> intimate and intrusive knowledge of the internals of data structures?
Yes, and this is by design.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-27  9:22                             ` Andrew Morton
@ 2009-02-27 10:59                               ` Alexey Dobriyan
  0 siblings, 0 replies; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-27 10:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ingo Molnar, Dave Hansen, mpm, containers, hpa, linux-kernel,
	linux-mm, viro, linux-api, torvalds, tglx, xemul
On Fri, Feb 27, 2009 at 01:22:09AM -0800, Andrew Morton wrote:
> On Fri, 27 Feb 2009 10:03:23 +0100 Ingo Molnar <mingo@elte.hu> wrote:
> 
> > >  arch/x86/include/asm/unistd_32.h   |    2 
> > >  arch/x86/kernel/syscall_table_32.S |    2 
> > >  include/linux/Kbuild               |    1 
> > >  include/linux/cr.h                 |   56 ++++++
> > >  include/linux/ipc_namespace.h      |    3 
> > >  include/linux/syscalls.h           |    5 
> > >  init/Kconfig                       |    2 
> > >  kernel/Makefile                    |    1 
> > >  kernel/cr/Kconfig                  |   11 +
> > >  kernel/cr/Makefile                 |    8 
> > >  kernel/cr/cpt-cred.c               |  115 +++++++++++++
> > >  kernel/cr/cpt-fs.c                 |  122 +++++++++++++
> > >  kernel/cr/cpt-mm.c                 |  134 +++++++++++++++
> > >  kernel/cr/cpt-ns.c                 |  324 +++++++++++++++++++++++++++++++++++++
> > >  kernel/cr/cpt-signal.c             |  121 +++++++++++++
> > >  kernel/cr/cpt-sys.c                |  228 ++++++++++++++++++++++++++
> > >  kernel/cr/cr-ctx.c                 |  141 ++++++++++++++++
> > >  kernel/cr/cr.h                     |   61 ++++++
> > >  kernel/cr/rst-sys.c                |    9 +
> > >  kernel/sys_ni.c                    |    3 
> > >  20 files changed, 1349 insertions(+)
> > 
> > That does not look scary to me at all. Andrew?
> 
> btw, why is there no need for a kernel/cr/cpt-net.c?
Too early :-) There is no rst-*.c counterparts either :-) But I'm
working on this.
But, yes, cpt-net.c will be definitely: all sorts of sockets, virtual
netdevices, iptables, ...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-27 10:53                               ` Alexey Dobriyan
@ 2009-02-27 14:33                                 ` Cedric Le Goater
  0 siblings, 0 replies; 121+ messages in thread
From: Cedric Le Goater @ 2009-02-27 14:33 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Greg Kurz, linux-api, containers, mpm, linux-kernel, Dave Hansen,
	linux-mm, tglx, viro, hpa, Ingo Molnar, torvalds, Andrew Morton,
	xemul
> How do you restore set of uts_namespace's?
	clone(CLONE_NEWUTS);
	sethostname(...)
> Kernel never exposes to userspace which are the same, which are independent.
 
I think you are addressing the problem from a kernel POV. If you see it
from the user POV, what he cares about is what the gethostname() returns 
and not 'struct uts_namespace'. 
that doesn't mean that C/R shouldn't be aware of the kernel implementation
but if you think in terms of user API, it makes life a easier.
Cheers,
C.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-26 22:31                       ` Alexey Dobriyan
  2009-02-27  9:03                         ` Ingo Molnar
@ 2009-02-27 16:14                         ` Dave Hansen
  2009-02-27 21:57                           ` Alexey Dobriyan
       [not found]                         ` <20090226223112.GA2939-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
  2 siblings, 1 reply; 121+ messages in thread
From: Dave Hansen @ 2009-02-27 16:14 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, linux-api, containers, hpa, linux-kernel, linux-mm,
	viro, mpm, Andrew Morton, torvalds, tglx, xemul
On Fri, 2009-02-27 at 01:31 +0300, Alexey Dobriyan wrote:
> > I think the main question is: will we ever find ourselves in the 
> > future saying that "C/R sucks, nobody but a small minority uses 
> > it, wish we had never merged it"? I think the likelyhood of that 
> > is very low. I think the current OpenVZ stuff already looks very 
> > useful, and i dont think we've realized (let alone explored) all 
> > the possibilities yet.
> 
> This is collecting and start of dumping part of cleaned up OpenVZ C/R
> implementation, FYI.
Are you just posting this to show how you expect c/r to look eventually?
Or are you proposing this as an alternative to what Oren has bee
posting?
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                             ` <20090227215749.GA3453-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
@ 2009-02-27 21:54                               ` Dave Hansen
  0 siblings, 0 replies; 121+ messages in thread
From: Dave Hansen @ 2009-02-27 21:54 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Sat, 2009-02-28 at 00:57 +0300, Alexey Dobriyan wrote:
> On Fri, Feb 27, 2009 at 08:14:58AM -0800, Dave Hansen wrote:
> > On Fri, 2009-02-27 at 01:31 +0300, Alexey Dobriyan wrote:
> > > > I think the main question is: will we ever find ourselves in the 
> > > > future saying that "C/R sucks, nobody but a small minority uses 
> > > > it, wish we had never merged it"? I think the likelyhood of that 
> > > > is very low. I think the current OpenVZ stuff already looks very 
> > > > useful, and i dont think we've realized (let alone explored) all 
> > > > the possibilities yet.
> > > 
> > > This is collecting and start of dumping part of cleaned up OpenVZ C/R
> > > implementation, FYI.
> > 
> > Are you just posting this to show how you expect c/r to look eventually?
> > Or are you proposing this as an alternative to what Oren has bee
> > posting?
> 
> This is under discussion right now.
Here as in LKML and containers@?  Or do you mean among the
OpenVZ/Virtuozzo folks?
The reason I ask is that we have gone through several rounds of
community review over the last few months with Oren's code, and I'd hate
to throw that away unless there's something wrong with it.  Is there
something wrong with it?
-- Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-27 16:14                         ` Dave Hansen
@ 2009-02-27 21:57                           ` Alexey Dobriyan
       [not found]                             ` <20090227215749.GA3453-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
  0 siblings, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-02-27 21:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, linux-api, containers, hpa, linux-kernel, linux-mm,
	viro, mpm, Andrew Morton, torvalds, tglx, xemul
On Fri, Feb 27, 2009 at 08:14:58AM -0800, Dave Hansen wrote:
> On Fri, 2009-02-27 at 01:31 +0300, Alexey Dobriyan wrote:
> > > I think the main question is: will we ever find ourselves in the 
> > > future saying that "C/R sucks, nobody but a small minority uses 
> > > it, wish we had never merged it"? I think the likelyhood of that 
> > > is very low. I think the current OpenVZ stuff already looks very 
> > > useful, and i dont think we've realized (let alone explored) all 
> > > the possibilities yet.
> > 
> > This is collecting and start of dumping part of cleaned up OpenVZ C/R
> > implementation, FYI.
> 
> Are you just posting this to show how you expect c/r to look eventually?
> Or are you proposing this as an alternative to what Oren has bee
> posting?
This is under discussion right now.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                         ` <20090226223112.GA2939-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
@ 2009-03-01  1:33                           ` Alexey Dobriyan
       [not found]                             ` <20090301013304.GA2428-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
  0 siblings, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-03-01  1:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Dave Hansen, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Fri, Feb 27, 2009 at 01:31:12AM +0300, Alexey Dobriyan wrote:
> This is collecting and start of dumping part of cleaned up OpenVZ C/R
> implementation, FYI.
OK, here is second version which shows what to do with shared objects
(cr_dump_nsproxy(), cr_dump_task_struct()), introduced more checks
(still no unlinked files) and dumps some more information including
structures connections (cr_pos_*)
Dumping pids in under thinking because in OpenVZ pids are saved as
numbers due to CLONE_NEWPID is not allowed in container. In presense
of multiple CLONE_NEWPID levels this must present a big problem. Looks
like there is now way to not dump pids as separate object.
As result, struct cr_image_pid is variable-sized, don't know how this will
play later.
Also, pid refcount check for external pointers is busted right now,
because /proc inode pins struct pid, so there is almost always refcount
vs ->o_count mismatch.
No restore yet. ;-)
 arch/x86/include/asm/unistd_32.h   |    2 
 arch/x86/kernel/syscall_table_32.S |    2 
 include/linux/Kbuild               |    1 
 include/linux/cr.h                 |  169 +++++++++++++
 include/linux/ipc_namespace.h      |    3 
 include/linux/syscalls.h           |    5 
 init/Kconfig                       |    2 
 kernel/Makefile                    |    1 
 kernel/cr/Kconfig                  |   11 
 kernel/cr/Makefile                 |    9 
 kernel/cr/cpt-cred.c               |  114 +++++++++
 kernel/cr/cpt-fs.c                 |  248 ++++++++++++++++++++
 kernel/cr/cpt-mm.c                 |  152 ++++++++++++
 kernel/cr/cpt-ns.c                 |  451 +++++++++++++++++++++++++++++++++++++
 kernel/cr/cpt-signal.c             |  166 +++++++++++++
 kernel/cr/cpt-sys.c                |  258 +++++++++++++++++++++
 kernel/cr/cpt-task.c               |  176 ++++++++++++++
 kernel/cr/cr-ctx.c                 |  102 ++++++++
 kernel/cr/cr.h                     |  104 ++++++++
 kernel/cr/rst-sys.c                |    9 
 kernel/sys_ni.c                    |    3 
 21 files changed, 1988 insertions(+)
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index f2bba78..9504ede 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -338,6 +338,8 @@
 #define __NR_dup3		330
 #define __NR_pipe2		331
 #define __NR_inotify_init1	332
+#define __NR_checkpoint		333
+#define __NR_restore		334
 
 #ifdef __KERNEL__
 
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index e2e86a0..9f8c398 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
 	.long sys_dup3			/* 330 */
 	.long sys_pipe2
 	.long sys_inotify_init1
+	.long sys_checkpoint
+	.long sys_restart
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index b97cdc5..113d257 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -50,6 +50,7 @@ header-y += coff.h
 header-y += comstats.h
 header-y += const.h
 header-y += cgroupstats.h
+header-y += cr.h
 header-y += cramfs_fs.h
 header-y += cycx_cfm.h
 header-y += dlmconstants.h
diff --git a/include/linux/cr.h b/include/linux/cr.h
new file mode 100644
index 0000000..a761e4c
--- /dev/null
+++ b/include/linux/cr.h
@@ -0,0 +1,169 @@
+#ifndef __INCLUDE_LINUX_CR_H
+#define __INCLUDE_LINUX_CR_H
+
+#include <linux/types.h>
+
+#define CR_POS_UNDEF	(~0ULL)
+
+struct cr_header {
+	/* Immutable part except version bumps. */
+#define CR_HEADER_MAGIC	"LinuxC/R"
+	__u8	cr_signature[8];
+#define CR_IMAGE_VERSION	1
+	__le64	cr_image_version;
+
+	/* Mutable part. */
+	__u8	cr_uts_release[64];	/* Give distro kernels a chance. */
+#define CR_ARCH_X86_32	1
+	__le32	cr_arch;
+};
+
+struct cr_object_header {
+#define CR_OBJ_TASK_STRUCT	1
+#define CR_OBJ_NSPROXY		2
+#define CR_OBJ_UTS_NS		3
+#define CR_OBJ_IPC_NS		4
+#define CR_OBJ_MNT_NS		5
+#define CR_OBJ_PID_NS		6
+#define CR_OBJ_NET_NS		7
+#define CR_OBJ_MM_STRUCT	8
+#define CR_OBJ_SIGNAL_STRUCT	9
+#define CR_OBJ_SIGHAND_STRUCT	10
+#define CR_OBJ_FS_STRUCT	11
+#define CR_OBJ_FILES_STRUCT	12
+#define CR_OBJ_FILE		13
+#define CR_OBJ_CRED		14
+#define CR_OBJ_PID		15
+	__u32	cr_type;	/* object type */
+	__u32	cr_len;		/* object length in bytes including header */
+};
+
+/*
+ * 1. struct cr_object_header MUST start object's image.
+ * 2. Every member SHOULD start with 'cr_' prefix.
+ * 3. Every member which refers to position of another object image in
+ *    a dumpfile MUST be __u64 and SHOULD additionally use 'pos_' prefix.
+ * 4. Size and layout of every object type image MUST be the same on all
+ *    architectures.
+ */
+
+struct cr_image_task_struct {
+	struct cr_object_header cr_hdr;
+
+	__u64	cr_pos_real_cred;
+	__u64	cr_pos_cred;
+	__u8	cr_comm[16];
+	__u64	cr_pos_mm_struct;
+	__u64	cr_pos_pids[3];
+	__u64	cr_pos_fs;
+	__u64	cr_pos_files;
+	__u64	cr_pos_nsproxy;
+	__u64	cr_pos_signal;
+	__u64	cr_pos_sighand;
+};
+
+struct cr_image_nsproxy {
+	struct cr_object_header cr_hdr;
+
+	__u64	cr_pos_uts_ns;
+	__u64	cr_pos_ipc_ns;	/* CR_POS_UNDEF if CONFIG_SYSVIPC=n */
+	__u64	cr_pos_mnt_ns;
+	__u64	cr_pos_pid_ns;
+	__u64	cr_pos_net_ns;	/* CR_POS_UNDEF if CONFIG_NET=n */
+};
+
+struct cr_image_uts_ns {
+	struct cr_object_header cr_hdr;
+
+	__u8	cr_sysname[64];
+	__u8	cr_nodename[64];
+	__u8	cr_release[64];
+	__u8	cr_version[64];
+	__u8	cr_machine[64];
+	__u8	cr_domainname[64];
+};
+
+struct cr_image_ipc_ns {
+	struct cr_object_header cr_hdr;
+};
+
+struct cr_image_mnt_ns {
+	struct cr_object_header cr_hdr;
+};
+
+struct cr_image_pid_ns {
+	struct cr_object_header cr_hdr;
+
+	__u32	cr_level;
+	__u32	cr_last_pid;
+};
+
+struct cr_image_net_ns {
+	struct cr_object_header cr_hdr;
+};
+
+struct cr_image_mm_struct {
+	struct cr_object_header cr_hdr;
+};
+
+struct cr_image_signal_struct {
+	struct cr_object_header cr_hdr;
+
+	struct {
+		__u64	cr_rlim_cur;
+		__u64	cr_rlim_max;
+	} cr_rlim[16];
+};
+
+struct cr_image_sighand_struct {
+	struct cr_object_header cr_hdr;
+};
+
+struct cr_image_fs_struct {
+	struct cr_object_header cr_hdr;
+
+	__u32	cr_umask;
+};
+
+struct cr_image_files_struct {
+	struct cr_object_header cr_hdr;
+};
+
+struct cr_image_file {
+	struct cr_object_header cr_hdr;
+
+	__u32	cr_f_flags;
+	__u32	_;
+	__u64	cr_f_pos;
+	__u64	cr_pos_f_owner_pid;
+	__u32	cr_f_owner_pid_type;
+	__u32	cr_f_owner_uid;
+	__u32	cr_f_owner_euid;
+	__u32	cr_f_owner_signum;
+	__u64	cr_pos_f_cred;
+};
+
+struct cr_image_cred {
+	struct cr_object_header cr_hdr;
+
+	__u32	cr_uid;
+	__u32	cr_gid;
+	__u32	cr_suid;
+	__u32	cr_sgid;
+	__u32	cr_euid;
+	__u32	cr_egid;
+	__u32	cr_fsuid;
+	__u32	cr_fsgid;
+	__u64	cr_cap_inheritable;
+	__u64	cr_cap_permitted;
+	__u64	cr_cap_effective;
+	__u64	cr_cap_bset;
+};
+
+struct cr_image_pid {
+	struct cr_object_header cr_hdr;
+
+	__u32	cr_level;
+	__u32	cr_nr[1];	/* cr_nr[cr_level + 1] */
+};
+#endif
diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index ea330f9..87a8053 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -3,9 +3,12 @@
 
 #include <linux/err.h>
 #include <linux/idr.h>
+#include <linux/kref.h>
 #include <linux/rwsem.h>
 #include <linux/notifier.h>
 
+struct kern_ipc_perm;
+
 /*
  * ipc namespace events
  */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f9f900c..fac8fa9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -692,6 +692,11 @@ asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
 asmlinkage long sys_pipe2(int __user *, int);
 asmlinkage long sys_pipe(int __user *);
 
+#ifdef CONFIG_CR
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags);
+asmlinkage long sys_restart(int fd, unsigned long flags);
+#endif
+
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index f068071..1b69c64 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -540,6 +540,8 @@ config CGROUP_MEM_RES_CTLR_SWAP
 
 endif # CGROUPS
 
+source "kernel/cr/Kconfig"
+
 config MM_OWNER
 	bool
 
diff --git a/kernel/Makefile b/kernel/Makefile
index e4791b3..71f9c68 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -93,6 +93,7 @@ obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
 obj-$(CONFIG_TRACING) += trace/
 obj-$(CONFIG_SMP) += sched_cpupri.o
+obj-$(CONFIG_CR) += cr/
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan-HpYC8cTCicIJY7gZg3T8ig@public.gmane.org>, the -fno-omit-frame-pointer is
diff --git a/kernel/cr/Kconfig b/kernel/cr/Kconfig
new file mode 100644
index 0000000..bebef29
--- /dev/null
+++ b/kernel/cr/Kconfig
@@ -0,0 +1,11 @@
+config CR
+	bool "Container checkpoint/restart"
+	depends on IPC_NS || (SYSVIPC = n)
+	depends on NET_NS || (NET = n)
+	depends on PID_NS
+	depends on USER_NS
+	depends on UTS_NS
+	select FREEZER
+	depends on X86_32
+	help
+	  Container checkpoint/restart
diff --git a/kernel/cr/Makefile b/kernel/cr/Makefile
new file mode 100644
index 0000000..1033425
--- /dev/null
+++ b/kernel/cr/Makefile
@@ -0,0 +1,9 @@
+obj-$(CONFIG_CR) += cr.o
+cr-y := cr-ctx.o
+cr-y += cpt-sys.o rst-sys.o
+cr-y += cpt-cred.o
+cr-y += cpt-fs.o
+cr-y += cpt-mm.o
+cr-y += cpt-ns.o
+cr-y += cpt-signal.o
+cr-y += cpt-task.o
diff --git a/kernel/cr/cpt-cred.c b/kernel/cr/cpt-cred.c
new file mode 100644
index 0000000..d071988
--- /dev/null
+++ b/kernel/cr/cpt-cred.c
@@ -0,0 +1,114 @@
+#include <linux/cr.h>
+#include <linux/cred.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include "cr.h"
+
+int cr_dump_cred(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct cred *cred = obj->o_obj;
+	struct cr_image_cred *i;
+
+	printk("%s: dump cred %p\n", __func__, cred);
+
+	i = cr_prepare_image(CR_OBJ_CRED, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	i->cr_uid = cred->uid;
+	i->cr_gid = cred->gid;
+	i->cr_suid = cred->suid;
+	i->cr_sgid = cred->sgid;
+	i->cr_euid = cred->euid;
+	i->cr_egid = cred->egid;
+	i->cr_fsuid = cred->fsuid;
+	i->cr_fsgid = cred->fsgid;
+	BUILD_BUG_ON(sizeof(cred->cap_inheritable) != 8);
+	memcpy(&i->cr_cap_inheritable, &cred->cap_inheritable, 8);
+	memcpy(&i->cr_cap_permitted, &cred->cap_permitted, 8);
+	memcpy(&i->cr_cap_effective, &cred->cap_effective, 8);
+	memcpy(&i->cr_cap_bset, &cred->cap_bset, 8);
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int cr_check_cred(struct cred *cred)
+{
+	if (cred->securebits)
+		return -EINVAL;
+#ifdef CONFIG_KEYS
+	if (cred->thread_keyring || cred->request_key_auth || cred->tgcred)
+		return -EINVAL;
+#endif
+#ifdef CONFIG_SECURITY
+	if (cred->security)
+		return -EINVAL;
+#endif
+	return 0;
+}
+
+static int __cr_collect_cred(struct cr_context *ctx, struct cred *cred)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_CRED) {
+		if (obj->o_obj == cred) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(cred);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_CRED]);
+	printk("%s: collect cred %p\n", __func__, cred);
+	return 0;
+}
+
+int cr_collect_cred(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = cr_check_cred((struct cred *)tsk->real_cred);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_cred(ctx, (struct cred *)tsk->real_cred);
+		if (rv < 0)
+			return rv;
+		rv = cr_check_cred((struct cred *)tsk->cred);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_cred(ctx, (struct cred *)tsk->cred);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_FILE) {
+		struct file *file = obj->o_obj;
+
+		rv = cr_check_cred((struct cred *)file->f_cred);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_cred(ctx, (struct cred *)file->f_cred);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_CRED) {
+		struct cred *cred = obj->o_obj;
+		unsigned int cnt = atomic_read(&cred->usage);
+
+		if (obj->o_count != cnt) {
+			printk("%s: cred %p has external references %lu:%u\n", __func__, cred, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
diff --git a/kernel/cr/cpt-fs.c b/kernel/cr/cpt-fs.c
new file mode 100644
index 0000000..b8ef0dd
--- /dev/null
+++ b/kernel/cr/cpt-fs.c
@@ -0,0 +1,248 @@
+#include <linux/cr.h>
+#include <linux/fdtable.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/sched.h>
+#include "cr.h"
+
+int cr_dump_file(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct file *file = obj->o_obj;
+	struct cr_object *tmp;
+	struct cr_image_file *i;
+
+	printk("%s: dump file %p\n", __func__, file);
+
+	i = cr_prepare_image(CR_OBJ_FILE, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	i->cr_f_flags = file->f_flags;
+	i->cr_f_pos = file->f_pos;
+	if (file->f_owner.pid) {
+		tmp = cr_find_obj_by_ptr(ctx, file->f_owner.pid, CR_CTX_PID);
+		i->cr_pos_f_owner_pid = tmp->o_pos;
+	} else
+		i->cr_pos_f_owner_pid = CR_POS_UNDEF;
+	i->cr_f_owner_pid_type = file->f_owner.pid_type;
+	i->cr_f_owner_uid = file->f_owner.uid;
+	i->cr_f_owner_euid = file->f_owner.euid;
+	i->cr_f_owner_signum = file->f_owner.signum;
+	tmp = cr_find_obj_by_ptr(ctx, file->f_cred, CR_CTX_CRED);
+	i->cr_pos_f_cred = tmp->o_pos;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int cr_check_file(struct file *file)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+
+#ifdef CONFIG_SECURITY
+	if (file->f_security)
+		return -EINVAL;
+#endif
+	spin_lock(&file->f_ep_lock);
+	if (!list_empty(&file->f_ep_links)) {
+		spin_unlock(&file->f_ep_lock);
+		return -EINVAL;
+	}
+	spin_unlock(&file->f_ep_lock);
+
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+		/* Likely on-disk filesystem. */
+		/* FIXME: FUSE, NFS, other networking filesystems */
+		if (inode->i_sb->s_type->fs_flags & FS_REQUIRES_DEV)
+			return 0;
+		break;
+	case S_IFBLK:
+		break;
+	case S_IFCHR:
+		break;
+	case S_IFIFO:
+		break;
+	case S_IFSOCK:
+		break;
+	case S_IFLNK:
+		/* One can't open symlink. */
+		BUG();
+	}
+	printk("%s: can't checkpoint file %p, ->f_op = %pS\n", __func__, file, file->f_op);
+	return -EINVAL;
+}
+
+int __cr_collect_file(struct cr_context *ctx, struct file *file)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_FILE) {
+		if (obj->o_obj == file) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(file);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_FILE]);
+	printk("%s: collect file %p\n", __func__, file);
+	return 0;
+}
+
+int cr_dump_files_struct(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct files_struct *files = obj->o_obj;
+	struct cr_image_files_struct *i;
+
+	printk("%s: dump files_struct %p\n", __func__, files);
+
+	i = cr_prepare_image(CR_OBJ_FILES_STRUCT, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int __cr_collect_files_struct(struct cr_context *ctx, struct files_struct *files)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_FILES_STRUCT) {
+		if (obj->o_obj == files) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(files);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_FILES_STRUCT]);
+	printk("%s: collect files_struct %p\n", __func__, files);
+	return 0;
+}
+
+int cr_collect_files_struct(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = __cr_collect_files_struct(ctx, tsk->files);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_FILES_STRUCT) {
+		struct files_struct *files = obj->o_obj;
+		unsigned int cnt = atomic_read(&files->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: files_struct %p has external references %lu:%u\n", __func__, files, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_FILES_STRUCT) {
+		struct files_struct *files = obj->o_obj;
+		int fd;
+
+		for (fd = 0; fd < files_fdtable(files)->max_fds; fd++) {
+			struct file *file;
+
+			file = fcheck_files(files, fd);
+			if (file) {
+				rv = cr_check_file(file);
+				if (rv < 0)
+					return rv;
+				rv = __cr_collect_file(ctx, file);
+				if (rv < 0)
+					return rv;
+			}
+		}
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_FILE) {
+		struct file *file = obj->o_obj;
+		unsigned long cnt = atomic_long_read(&file->f_count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: file %p/%pS has external references %lu:%lu\n", __func__, file, file->f_op, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+int cr_dump_fs_struct(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct fs_struct *fs = obj->o_obj;
+	struct cr_image_fs_struct *i;
+
+	printk("%s: dump fs_struct %p\n", __func__, fs);
+
+	i = cr_prepare_image(CR_OBJ_FS_STRUCT, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	i->cr_umask = fs->umask;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int __cr_collect_fs_struct(struct cr_context *ctx, struct fs_struct *fs)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_FS_STRUCT) {
+		if (obj->o_obj == fs) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(fs);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_FS_STRUCT]);
+	printk("%s: collect fs_struct %p\n", __func__, fs);
+	return 0;
+}
+
+int cr_collect_fs_struct(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+
+		rv = __cr_collect_fs_struct(ctx, tsk->fs);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_FS_STRUCT) {
+		struct fs_struct *fs = obj->o_obj;
+		unsigned int cnt = atomic_read(&fs->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: fs_struct %p has external references %lu:%u\n", __func__, fs, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
diff --git a/kernel/cr/cpt-mm.c b/kernel/cr/cpt-mm.c
new file mode 100644
index 0000000..830f180
--- /dev/null
+++ b/kernel/cr/cpt-mm.c
@@ -0,0 +1,152 @@
+#include <linux/cr.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/sched.h>
+#include "cr.h"
+
+static int cr_check_vma(struct vm_area_struct *vma)
+{
+	unsigned long flags = vma->vm_flags;
+
+	/* Flags, we know and love. */
+	flags &= ~VM_READ;
+	flags &= ~VM_WRITE;
+	flags &= ~VM_EXEC;
+	flags &= ~VM_MAYREAD;
+	flags &= ~VM_MAYWRITE;
+	flags &= ~VM_MAYEXEC;
+	flags &= ~VM_GROWSDOWN;
+	flags &= ~VM_DENYWRITE;
+	flags &= ~VM_EXECUTABLE;
+	flags &= ~VM_DONTEXPAND;
+	flags &= ~VM_ACCOUNT;
+	flags &= ~VM_ALWAYSDUMP;
+	flags &= ~VM_CAN_NONLINEAR;
+	/* Flags, we don't know and don't love. */
+	if (flags) {
+		printk("%s: vma = %p, unknown ->vm_flags 0x%lx\n", __func__, vma, flags);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+int cr_dump_mm_struct(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct mm_struct *mm = obj->o_obj;
+	struct cr_image_mm_struct *i;
+
+	printk("%s: dump mm_struct %p\n", __func__, mm);
+
+	i = cr_prepare_image(CR_OBJ_MM_STRUCT, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int cr_check_mm(struct mm_struct *mm, struct task_struct *tsk)
+{
+	if (!mm)
+		return -EINVAL;
+	down_read(&mm->mmap_sem);
+	if (mm->core_state) {
+		up_read(&mm->mmap_sem);
+		return -EINVAL;
+	}
+	up_read(&mm->mmap_sem);
+#ifdef CONFIG_AIO
+	spin_lock(&mm->ioctx_lock);
+	if (!hlist_empty(&mm->ioctx_list)) {
+		spin_unlock(&mm->ioctx_lock);
+		return -EINVAL;
+	}
+	spin_unlock(&mm->ioctx_lock);
+#endif
+#ifdef CONFIG_MM_OWNER
+	if (mm->owner != tsk)
+		return -EINVAL;
+#endif
+#ifdef CONFIG_MMU_NOTIFIER
+	down_read(&mm->mmap_sem);
+	if (mm_has_notifiers(mm)) {
+		up_read(&mm->mmap_sem);
+		return -EINVAL;
+	}
+	up_read(&mm->mmap_sem);
+#endif
+	return 0;
+}
+
+static int __cr_collect_mm(struct cr_context *ctx, struct mm_struct *mm)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_MM_STRUCT) {
+		if (obj->o_obj == mm) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(mm);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_MM_STRUCT]);
+	printk("%s: collect mm_struct %p\n", __func__, mm);
+	return 0;
+}
+
+int cr_collect_mm(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+		struct mm_struct *mm = tsk->mm;
+
+		rv = cr_check_mm(mm, tsk);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_mm(ctx, mm);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_MM_STRUCT) {
+		struct mm_struct *mm = obj->o_obj;
+		unsigned int cnt = atomic_read(&mm->mm_users);
+
+		if (obj->o_count != cnt) {
+			printk("%s: mm_struct %p has external references %lu:%u\n", __func__, mm, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_MM_STRUCT) {
+		struct mm_struct *mm = obj->o_obj;
+		struct vm_area_struct *vma;
+
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			rv = cr_check_vma(vma);
+			if (rv < 0)
+				return rv;
+			if (vma->vm_file) {
+				rv = __cr_collect_file(ctx, vma->vm_file);
+				if (rv < 0)
+					return rv;
+			}
+		}
+#ifdef CONFIG_PROC_FS
+		if (mm->exe_file) {
+			rv = __cr_collect_file(ctx, mm->exe_file);
+			if (rv < 0)
+				return rv;
+		}
+#endif
+	}
+	return 0;
+}
diff --git a/kernel/cr/cpt-ns.c b/kernel/cr/cpt-ns.c
new file mode 100644
index 0000000..07dd1f4
--- /dev/null
+++ b/kernel/cr/cpt-ns.c
@@ -0,0 +1,451 @@
+#include <linux/cr.h>
+#include <linux/fs.h>
+#include <linux/ipc_namespace.h>
+#include <linux/kref.h>
+#include <linux/nsproxy.h>
+#include <linux/mnt_namespace.h>
+#include <linux/pid_namespace.h>
+#include <linux/utsname.h>
+#include <net/net_namespace.h>
+#include "cr.h"
+
+int cr_dump_uts_ns(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct uts_namespace *uts_ns = obj->o_obj;
+	struct cr_image_uts_ns *i;
+
+	printk("%s: dump uts_ns %p\n", __func__, uts_ns);
+
+	i = cr_prepare_image(CR_OBJ_UTS_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	strncpy((char *)i->cr_sysname, (const char *)uts_ns->name.sysname, 64);
+	strncpy((char *)i->cr_nodename, (const char *)uts_ns->name.nodename, 64);
+	strncpy((char *)i->cr_release, (const char *)uts_ns->name.release, 64);
+	strncpy((char *)i->cr_version, (const char *)uts_ns->name.version, 64);
+	strncpy((char *)i->cr_machine, (const char *)uts_ns->name.machine, 64);
+	strncpy((char *)i->cr_domainname, (const char *)uts_ns->name.domainname, 64);
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int __cr_collect_uts_ns(struct cr_context *ctx, struct uts_namespace *uts_ns)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_UTS_NS) {
+		if (obj->o_obj == uts_ns) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(uts_ns);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_UTS_NS]);
+	printk("%s: collect uts_ns %p\n", __func__, uts_ns);
+	return 0;
+}
+
+static int cr_collect_uts_ns(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = __cr_collect_uts_ns(ctx, nsproxy->uts_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_UTS_NS) {
+		struct uts_namespace *uts_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&uts_ns->kref.refcount);
+
+		if (obj->o_count != cnt) {
+			printk("%s: uts_ns %p has external references %lu:%u\n", __func__, uts_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+#ifdef CONFIG_SYSVIPC
+int cr_dump_ipc_ns(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct ipc_namespace *ipc_ns = obj->o_obj;
+	struct cr_image_ipc_ns *i;
+
+	printk("%s: dump ipc_ns %p\n", __func__, ipc_ns);
+
+	i = cr_prepare_image(CR_OBJ_IPC_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int __cr_collect_ipc_ns(struct cr_context *ctx, struct ipc_namespace *ipc_ns)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_IPC_NS) {
+		if (obj->o_obj == ipc_ns) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(ipc_ns);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_IPC_NS]);
+	printk("%s: collect ipc_ns %p\n", __func__, ipc_ns);
+	return 0;
+}
+
+static int cr_collect_ipc_ns(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = __cr_collect_ipc_ns(ctx, nsproxy->ipc_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_IPC_NS) {
+		struct ipc_namespace *ipc_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&ipc_ns->kref.refcount);
+
+		if (obj->o_count != cnt) {
+			printk("%s: ipc_ns %p has external references %lu:%u\n", __func__, ipc_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+#else
+static int cr_collect_ipc_ns(struct cr_context *ctx)
+{
+	return 0;
+}
+#endif
+
+int cr_dump_mnt_ns(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct mnt_namespace *mnt_ns = obj->o_obj;
+	struct cr_image_mnt_ns *i;
+
+	printk("%s: dump mnt_ns %p\n", __func__, mnt_ns);
+
+	i = cr_prepare_image(CR_OBJ_MNT_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int __cr_collect_mnt_ns(struct cr_context *ctx, struct mnt_namespace *mnt_ns)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_MNT_NS) {
+		if (obj->o_obj == mnt_ns) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(mnt_ns);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_MNT_NS]);
+	printk("%s: collect mnt_ns %p\n", __func__, mnt_ns);
+	return 0;
+}
+
+static int cr_collect_mnt_ns(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = __cr_collect_mnt_ns(ctx, nsproxy->mnt_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_MNT_NS) {
+		struct mnt_namespace *mnt_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&mnt_ns->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: mnt_ns %p has external references %lu:%u\n", __func__, mnt_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+int cr_dump_pid_ns(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct pid_namespace *pid_ns = obj->o_obj;
+	struct cr_image_pid_ns *i;
+
+	printk("%s: dump pid_ns %p\n", __func__, pid_ns);
+
+	i = cr_prepare_image(CR_OBJ_PID_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	i->cr_level = pid_ns->level;
+	i->cr_last_pid = pid_ns->last_pid;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int cr_check_pid_ns(struct pid_namespace *pid_ns)
+{
+#ifdef CONFIG_BSD_PROCESS_ACCT
+	if (pid_ns->bacct)
+		return -EINVAL;
+#endif
+	return 0;
+}
+
+static int __cr_collect_pid_ns(struct cr_context *ctx, struct pid_namespace *pid_ns)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_PID_NS) {
+		if (obj->o_obj == pid_ns) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(pid_ns);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_PID_NS]);
+	printk("%s: collect pid_ns %p\n", __func__, pid_ns);
+	return 0;
+}
+
+static int cr_collect_pid_ns(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+		struct pid_namespace *pid_ns = nsproxy->pid_ns;
+
+		rv = cr_check_pid_ns(pid_ns);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_pid_ns(ctx, pid_ns);
+		if (rv < 0)
+			return rv;
+	}
+	/*
+	 * FIXME: check for external pid_ns references
+	 * 1. struct pid pins pid_ns
+	 * 2. struct pid_namespace pins pid_ns, but only parent one
+	 */
+	return 0;
+}
+
+#ifdef CONFIG_NET
+int cr_dump_net_ns(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct net *net = obj->o_obj;
+	struct cr_image_net_ns *i;
+
+	printk("%s: dump net_ns %p\n", __func__, net);
+
+	i = cr_prepare_image(CR_OBJ_NET_NS, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int __cr_collect_net_ns(struct cr_context *ctx, struct net *net_ns)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_NET_NS) {
+		if (obj->o_obj == net_ns) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(net_ns);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_NET_NS]);
+	printk("%s: collect net_ns %p\n", __func__, net_ns);
+	return 0;
+}
+
+static int cr_collect_net_ns(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+
+		rv = __cr_collect_net_ns(ctx, nsproxy->net_ns);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_NET_NS) {
+		struct net *net_ns = obj->o_obj;
+		unsigned int cnt = atomic_read(&net_ns->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: net_ns %p has external references %lu:%u\n", __func__, net_ns, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+#else
+static int cr_collect_net_ns(struct cr_context *ctx)
+{
+	return 0;
+}
+#endif
+
+int cr_dump_nsproxy(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct nsproxy *nsproxy = obj->o_obj;
+	struct cr_object *tmp;
+	struct cr_image_nsproxy	*i;
+
+	printk("%s: dump nsproxy %p\n", __func__, nsproxy);
+
+	i = cr_prepare_image(CR_OBJ_NSPROXY, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	tmp = cr_find_obj_by_ptr(ctx, nsproxy->uts_ns, CR_CTX_UTS_NS);
+	i->cr_pos_uts_ns = tmp->o_pos;
+#ifdef CONFIG_SYSVIPC
+	tmp = cr_find_obj_by_ptr(ctx, nsproxy->ipc_ns, CR_CTX_IPC_NS);
+	i->cr_pos_ipc_ns = tmp->o_pos;
+#else
+	i->cr_pos = CR_POS_UNDEF;
+#endif
+	tmp = cr_find_obj_by_ptr(ctx, nsproxy->mnt_ns, CR_CTX_MNT_NS);
+	i->cr_pos_mnt_ns = tmp->o_pos;
+	tmp = cr_find_obj_by_ptr(ctx, nsproxy->pid_ns, CR_CTX_PID_NS);
+	i->cr_pos_pid_ns = tmp->o_pos;
+#ifdef CONFIG_NET
+	tmp = cr_find_obj_by_ptr(ctx, nsproxy->net_ns, CR_CTX_NET_NS);
+	i->cr_pos_net_ns = tmp->o_pos;
+#else
+	i->cr_pos_net_ns = CR_POS_UNDEF;
+#endif
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int __cr_collect_nsproxy(struct cr_context *ctx, struct nsproxy *nsproxy)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_NSPROXY) {
+		if (obj->o_obj == nsproxy) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(nsproxy);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_NSPROXY]);
+	printk("%s: collect nsproxy %p\n", __func__, nsproxy);
+	return 0;
+}
+
+int cr_collect_nsproxy(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+		struct nsproxy *nsproxy;
+
+		rcu_read_lock();
+		nsproxy = task_nsproxy(tsk);
+		rcu_read_unlock();
+		if (!nsproxy)
+			return -EAGAIN;
+
+		rv = __cr_collect_nsproxy(ctx, nsproxy);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_NSPROXY) {
+		struct nsproxy *nsproxy = obj->o_obj;
+		unsigned int cnt = atomic_read(&nsproxy->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: nsproxy %p has external references %lu:%u\n", __func__, nsproxy, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	rv = cr_collect_uts_ns(ctx);
+	if (rv < 0)
+		return rv;
+	rv = cr_collect_ipc_ns(ctx);
+	if (rv < 0)
+		return rv;
+	rv = cr_collect_mnt_ns(ctx);
+	if (rv < 0)
+		return rv;
+	rv = cr_collect_pid_ns(ctx);
+	if (rv < 0)
+		return rv;
+	rv = cr_collect_net_ns(ctx);
+	if (rv < 0)
+		return rv;
+	return 0;
+}
diff --git a/kernel/cr/cpt-signal.c b/kernel/cr/cpt-signal.c
new file mode 100644
index 0000000..32a5bd7
--- /dev/null
+++ b/kernel/cr/cpt-signal.c
@@ -0,0 +1,166 @@
+#include <linux/cr.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include "cr.h"
+
+int cr_dump_signal_struct(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct signal_struct *signal = obj->o_obj;
+	struct cr_image_signal_struct *i;
+	int n;
+
+	printk("%s: dump signal_struct %p\n", __func__, signal);
+
+	i = cr_prepare_image(CR_OBJ_SIGNAL_STRUCT, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	BUILD_BUG_ON(RLIM_NLIMITS != 16);
+	for (n = 0; n < RLIM_NLIMITS; n++) {
+		i->cr_rlim[n].cr_rlim_cur = signal->rlim[n].rlim_cur;
+		i->cr_rlim[n].cr_rlim_max = signal->rlim[n].rlim_max;
+	}
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int cr_check_signal(struct signal_struct *signal)
+{
+	if (!signal)
+		return -EINVAL;
+	if (!list_empty(&signal->posix_timers))
+		return -EINVAL;
+#ifdef CONFIG_KEYS
+	if (signal->session_keyring || signal->process_keyring)
+		return -EINVAL;
+#endif
+	return 0;
+}
+
+static int __cr_collect_signal(struct cr_context *ctx, struct signal_struct *signal)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_SIGNAL_STRUCT) {
+		if (obj->o_obj == signal) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(signal);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_SIGNAL_STRUCT]);
+	printk("%s: collect signal_struct %p\n", __func__, signal);
+	return 0;
+}
+
+int cr_collect_signal(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+		struct signal_struct *signal = tsk->signal;
+
+		rv = cr_check_signal(signal);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_signal(ctx, signal);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_SIGNAL_STRUCT) {
+		struct signal_struct *signal = obj->o_obj;
+		unsigned int cnt = atomic_read(&signal->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: signal_struct %p has external references %lu:%u\n", __func__, signal, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
+
+int cr_dump_sighand_struct(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct sighand_struct *sighand = obj->o_obj;
+	struct cr_image_sighand_struct *i;
+
+	printk("%s: dump sighand_struct %p\n", __func__, sighand);
+
+	i = cr_prepare_image(CR_OBJ_SIGHAND_STRUCT, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int cr_check_sighand(struct sighand_struct *sighand)
+{
+	if (!sighand)
+		return -EINVAL;
+#ifdef CONFIG_SIGNALFD
+	if (waitqueue_active(&sighand->signalfd_wqh))
+		return -EINVAL;
+#endif
+	return 0;
+}
+
+static int __cr_collect_sighand(struct cr_context *ctx, struct sighand_struct *sighand)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_SIGHAND_STRUCT) {
+		if (obj->o_obj == sighand) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(sighand);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_SIGHAND_STRUCT]);
+	printk("%s: collect sighand_struct %p\n", __func__, sighand);
+	return 0;
+}
+
+int cr_collect_sighand(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+		struct sighand_struct *sighand = tsk->sighand;
+
+		rv = cr_check_sighand(sighand);
+		if (rv < 0)
+			return rv;
+		rv = __cr_collect_sighand(ctx, sighand);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_SIGHAND_STRUCT) {
+		struct sighand_struct *sighand = obj->o_obj;
+		unsigned int cnt = atomic_read(&sighand->count);
+
+		if (obj->o_count != cnt) {
+			printk("%s: sighand_struct %p has external references %lu:%u\n", __func__, sighand, obj->o_count, cnt);
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
diff --git a/kernel/cr/cpt-sys.c b/kernel/cr/cpt-sys.c
new file mode 100644
index 0000000..6d32243
--- /dev/null
+++ b/kernel/cr/cpt-sys.c
@@ -0,0 +1,258 @@
+#include <linux/capability.h>
+#include <linux/cr.h>
+#include <linux/file.h>
+#include <linux/freezer.h>
+#include <linux/fs.h>
+#include <linux/nsproxy.h>
+#include <linux/pid_namespace.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+#include <linux/utsname.h>
+#include "cr.h"
+
+/* 'tsk' is child of 'parent' in some generation. */
+static int child_of(struct task_struct *parent, struct task_struct *tsk)
+{
+	struct task_struct *tmp = tsk;
+
+	while (tmp != &init_task) {
+		if (tmp == parent)
+			return 1;
+		tmp = tmp->real_parent;
+	}
+	/* In case 'parent' is 'init_task'. */
+	return tmp == parent;
+}
+
+static int cr_freeze_tasks(struct task_struct *init_tsk)
+{
+	struct task_struct *tmp, *tsk;
+
+	read_lock(&tasklist_lock);
+	do_each_thread(tmp, tsk) {
+		if (child_of(init_tsk, tsk)) {
+			if (!freeze_task(tsk, 1)) {
+				printk("%s: freezing '%s' failed\n", __func__, tsk->comm);
+				read_unlock(&tasklist_lock);
+				return -EBUSY;
+			}
+		}
+	} while_each_thread(tmp, tsk);
+	read_unlock(&tasklist_lock);
+	return 0;
+}
+
+static void cr_thaw_tasks(struct task_struct *init_tsk)
+{
+	struct task_struct *tmp, *tsk;
+
+	read_lock(&tasklist_lock);
+	do_each_thread(tmp, tsk) {
+		if (child_of(init_tsk, tsk))
+			thaw_process(tsk);
+	} while_each_thread(tmp, tsk);
+	read_unlock(&tasklist_lock);
+}
+
+static void cr_dump_header(struct cr_context *ctx)
+{
+	struct cr_header hdr;
+
+	memset(&hdr, 0, sizeof(struct cr_header));
+	hdr.cr_signature[0] = 'L';
+	hdr.cr_signature[1] = 'i';
+	hdr.cr_signature[2] = 'n';
+	hdr.cr_signature[3] = 'u';
+	hdr.cr_signature[4] = 'x';
+	hdr.cr_signature[5] = 'C';
+	hdr.cr_signature[6] = '/';
+	hdr.cr_signature[7] = 'R';
+	hdr.cr_image_version = cpu_to_le64(CR_IMAGE_VERSION);
+	strncpy((char *)&hdr.cr_uts_release, (const char *)init_uts_ns.name.release, 64);
+#ifdef CONFIG_X86_32
+	hdr.cr_arch = cpu_to_le32(CR_ARCH_X86_32);
+#endif
+	cr_write(ctx, &hdr, sizeof(struct cr_header));
+	cr_align(ctx);
+}
+
+static int cr_dump(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	cr_dump_header(ctx);
+
+	for_each_cr_object(ctx, obj, CR_CTX_PID) {
+		rv = cr_dump_pid(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+
+	for_each_cr_object(ctx, obj, CR_CTX_CRED) {
+		rv = cr_dump_cred(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_FILE) {
+		rv = cr_dump_file(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_FILES_STRUCT) {
+		rv = cr_dump_files_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_FS_STRUCT) {
+		rv = cr_dump_fs_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_UTS_NS) {
+		rv = cr_dump_uts_ns(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+#ifdef CONFIG_SYSVIPC
+	for_each_cr_object(ctx, obj, CR_CTX_IPC_NS) {
+		rv = cr_dump_ipc_ns(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+#endif
+	for_each_cr_object(ctx, obj, CR_CTX_MNT_NS) {
+		rv = cr_dump_mnt_ns(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_PID_NS) {
+		rv = cr_dump_pid_ns(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+#ifdef CONFIG_NET
+	for_each_cr_object(ctx, obj, CR_CTX_NET_NS) {
+		rv = cr_dump_net_ns(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+#endif
+	for_each_cr_object(ctx, obj, CR_CTX_SIGHAND_STRUCT) {
+		rv = cr_dump_sighand_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_SIGNAL_STRUCT) {
+		rv = cr_dump_signal_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_MM_STRUCT) {
+		rv = cr_dump_mm_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	/* After all namespaces. */
+	for_each_cr_object(ctx, obj, CR_CTX_NSPROXY) {
+		rv = cr_dump_nsproxy(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	/* After nsproxies. */
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		rv = cr_dump_task_struct(ctx, obj);
+		if (rv < 0)
+			return rv;
+	}
+	return 0;
+}
+
+SYSCALL_DEFINE3(checkpoint, pid_t, pid, int, fd, unsigned long, flags)
+{
+	struct cr_context *ctx;
+	struct file *file;
+	struct task_struct *init_tsk = NULL, *tsk;
+	int rv = 0;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	file = fget(fd);
+	if (!file)
+		return -EBADF;
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EBADF;
+	if (!file->f_op || !file->f_op->write)
+		return -EINVAL;
+
+	rcu_read_lock();
+	tsk = find_task_by_vpid(pid);
+	if (tsk) {
+		init_tsk = task_nsproxy(tsk)->pid_ns->child_reaper;
+		get_task_struct(init_tsk);
+	}
+	rcu_read_unlock();
+	if (!init_tsk) {
+		rv = -ESRCH;
+		goto out_no_init_tsk;
+	}
+
+	ctx = cr_context_create(init_tsk, file);
+	if (!ctx) {
+		rv = -ENOMEM;
+		goto out_ctx_alloc;
+	}
+
+	rv = cr_freeze_tasks(init_tsk);
+	if (rv < 0)
+		goto out_freeze;
+	rv = cr_collect_tasks(ctx, init_tsk);
+	if (rv < 0)
+		goto out_collect_tasks;
+	rv = cr_collect_nsproxy(ctx);
+	if (rv < 0)
+		goto out_collect_nsproxy;
+	rv = cr_collect_mm(ctx);
+	if (rv < 0)
+		goto out_collect_mm;
+	rv = cr_collect_files_struct(ctx);
+	if (rv < 0)
+		goto out_collect_files_struct;
+	rv = cr_collect_fs_struct(ctx);
+	if (rv < 0)
+		goto out_collect_fs_struct;
+	/* After tasks and after files. */
+	rv = cr_collect_cred(ctx);
+	if (rv < 0)
+		goto out_collect_cred;
+	rv = cr_collect_signal(ctx);
+	if (rv < 0)
+		goto out_collect_signal;
+	rv = cr_collect_sighand(ctx);
+	if (rv < 0)
+		goto out_collect_sighand;
+	rv = cr_collect_pid(ctx);
+	if (rv < 0)
+		goto out_collect_pid;
+
+	rv = cr_dump(ctx);
+
+out_collect_pid:
+out_collect_sighand:
+out_collect_signal:
+out_collect_cred:
+out_collect_fs_struct:
+out_collect_files_struct:
+out_collect_mm:
+out_collect_nsproxy:
+out_collect_tasks:
+	cr_thaw_tasks(init_tsk);
+out_freeze:
+	cr_context_destroy(ctx);
+out_ctx_alloc:
+	put_task_struct(init_tsk);
+out_no_init_tsk:
+	fput(file);
+	return rv;
+}
diff --git a/kernel/cr/cpt-task.c b/kernel/cr/cpt-task.c
new file mode 100644
index 0000000..f4274ba
--- /dev/null
+++ b/kernel/cr/cpt-task.c
@@ -0,0 +1,176 @@
+#include <linux/cr.h>
+#include <linux/fs.h>
+#include <linux/pid.h>
+#include <linux/sched.h>
+#include "cr.h"
+
+int cr_dump_pid(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct pid *pid = obj->o_obj;
+	struct cr_image_pid *i;
+	size_t image_len;
+	unsigned int level;
+
+	printk("%s: dump pid %p\n", __func__, pid);
+
+	/* FIXME pid numbers for levels below level of init_tsk are irrelevant */
+	image_len = sizeof(*i) + pid->level * sizeof(__u32);
+
+	i = cr_prepare_image(CR_OBJ_PID, image_len);
+	if (!i)
+		return -ENOMEM;
+
+	for (level = 0; level <= pid->level; level++)
+		i->cr_nr[level] = pid->numbers[level].nr;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, image_len);
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int __cr_collect_pid(struct cr_context *ctx, struct pid *pid)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_PID) {
+		if (obj->o_obj == pid) {
+			obj->o_count++;
+			return 0;
+		}
+	}
+
+	obj = cr_object_create(pid);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_PID]);
+	printk("%s: collect pid %p\n", __func__, pid);
+	return 0;
+}
+
+int cr_collect_pid(struct cr_context *ctx)
+{
+	struct cr_object *obj;
+	int rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj;
+		int i;
+
+		printk("%s: tsk = %p/%s, ->group_leader = %p/%s\n", __func__, tsk, tsk->comm, tsk->group_leader, tsk->group_leader->comm);
+		for (i = 0; i < PIDTYPE_MAX; i++) {
+			struct pid *pid = tsk->pids[i].pid;
+
+			rv = __cr_collect_pid(ctx, pid);
+			if (rv < 0)
+				return rv;
+		}
+	}
+	for_each_cr_object(ctx, obj, CR_CTX_FILE) {
+		struct file *file = obj->o_obj;
+		struct pid *pid = file->f_owner.pid;
+
+		if (pid) {
+			rv = __cr_collect_pid(ctx, pid);
+			if (rv < 0)
+				return rv;
+		}
+	}
+	/* FIXME pid refcount check should account references from proc inodes */
+	return 0;
+}
+
+int cr_dump_task_struct(struct cr_context *ctx, struct cr_object *obj)
+{
+	struct task_struct *tsk = obj->o_obj;
+	struct cr_object *tmp;
+	struct cr_image_task_struct *i;
+	int n;
+
+	printk("%s: dump task_struct %p\n", __func__, tsk);
+
+	i = cr_prepare_image(CR_OBJ_TASK_STRUCT, sizeof(*i));
+	if (!i)
+		return -ENOMEM;
+
+	tmp = cr_find_obj_by_ptr(ctx, tsk->mm, CR_CTX_MM_STRUCT);
+	i->cr_pos_mm_struct = tmp->o_pos;
+	BUILD_BUG_ON(PIDTYPE_MAX != 3);
+	for (n = 0; n < PIDTYPE_MAX; n++) {
+		tmp = cr_find_obj_by_ptr(ctx, tsk->pids[n].pid, CR_CTX_PID);
+		i->cr_pos_pids[n] = tmp->o_pos;
+	}
+	tmp = cr_find_obj_by_ptr(ctx, tsk->real_cred, CR_CTX_CRED);
+	i->cr_pos_real_cred = tmp->o_pos;
+	tmp = cr_find_obj_by_ptr(ctx, tsk->cred, CR_CTX_CRED);
+	i->cr_pos_cred = tmp->o_pos;
+	BUILD_BUG_ON(TASK_COMM_LEN != 16);
+	strncpy((char *)i->cr_comm, (const char *)tsk->comm, 16);
+	tmp = cr_find_obj_by_ptr(ctx, tsk->fs, CR_CTX_FS_STRUCT);
+	i->cr_pos_fs = tmp->o_pos;
+	tmp = cr_find_obj_by_ptr(ctx, tsk->files, CR_CTX_FILES_STRUCT);
+	i->cr_pos_files = tmp->o_pos;
+	tmp = cr_find_obj_by_ptr(ctx, tsk->nsproxy, CR_CTX_NSPROXY);
+	i->cr_pos_nsproxy = tmp->o_pos;
+	tmp = cr_find_obj_by_ptr(ctx, tsk->signal, CR_CTX_SIGNAL_STRUCT);
+	i->cr_pos_signal = tmp->o_pos;
+	tmp = cr_find_obj_by_ptr(ctx, tsk->sighand, CR_CTX_SIGHAND_STRUCT);
+	i->cr_pos_sighand = tmp->o_pos;
+
+	obj->o_pos = ctx->cr_dump_file->f_pos;
+	cr_write(ctx, i, sizeof(*i));
+	cr_align(ctx);
+	kfree(i);
+	return 0;
+}
+
+static int __cr_collect_task(struct cr_context *ctx, struct task_struct *tsk)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		/* task_struct is never shared. */
+		BUG_ON(obj->o_obj == tsk);
+	}
+
+	obj = cr_object_create(tsk);
+	if (!obj)
+		return -ENOMEM;
+	list_add_tail(&obj->o_list, &ctx->cr_obj[CR_CTX_TASK_STRUCT]);
+	printk("%s: collect task %p/%s\n", __func__, tsk, tsk->comm);
+	return 0;
+}
+
+int cr_collect_tasks(struct cr_context *ctx, struct task_struct *init_tsk)
+{
+	struct cr_object *obj;
+	int rv;
+
+	rv = __cr_collect_task(ctx, init_tsk);
+	if (rv < 0)
+		return rv;
+
+	for_each_cr_object(ctx, obj, CR_CTX_TASK_STRUCT) {
+		struct task_struct *tsk = obj->o_obj, *child;
+
+		/* Collect threads. */
+		if (thread_group_leader(tsk)) {
+			struct task_struct *thread = tsk;
+
+			while ((thread = next_thread(thread)) != tsk) {
+				rv = __cr_collect_task(ctx, thread);
+				if (rv < 0)
+					return rv;
+			}
+		}
+
+		/* Collect children. */
+		list_for_each_entry(child, &tsk->children, sibling) {
+			rv = __cr_collect_task(ctx, child);
+			if (rv < 0)
+				return rv;
+		}
+	}
+	return 0;
+}
diff --git a/kernel/cr/cr-ctx.c b/kernel/cr/cr-ctx.c
new file mode 100644
index 0000000..2385d20
--- /dev/null
+++ b/kernel/cr/cr-ctx.c
@@ -0,0 +1,102 @@
+#include <linux/cr.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/nsproxy.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <asm/processor.h>
+#include <asm/uaccess.h>
+#include "cr.h"
+
+void *cr_prepare_image(unsigned int type, size_t len)
+{
+	void *p;
+
+	p = kzalloc(len, GFP_KERNEL);
+	if (p) {
+		/* Any image must start with header. */
+		struct cr_object_header *cr_hdr = p;
+
+		cr_hdr->cr_type = type;
+		cr_hdr->cr_len = len;
+	}
+	return p;
+}
+
+void cr_write(struct cr_context *ctx, const void *buf, size_t count)
+{
+	struct file *file = ctx->cr_dump_file;
+	mm_segment_t old_fs;
+	ssize_t rv;
+
+	if (ctx->cr_write_error)
+		return;
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	rv = file->f_op->write(file, (const char __user *)buf, count, &file->f_pos);
+	set_fs(old_fs);
+	if (rv != count)
+		ctx->cr_write_error = (rv < 0) ? rv : -EIO;
+}
+
+void cr_align(struct cr_context *ctx)
+{
+	struct file *file = ctx->cr_dump_file;
+
+	file->f_pos = ALIGN(file->f_pos, 8);
+}
+
+struct cr_object *cr_object_create(void *data)
+{
+	struct cr_object *obj;
+
+	obj = kmalloc(sizeof(struct cr_object), GFP_KERNEL);
+	if (obj) {
+		obj->o_count = 1;
+		obj->o_obj = data;
+	}
+	return obj;
+}
+
+struct cr_context *cr_context_create(struct task_struct *tsk, struct file *file)
+{
+	struct cr_context *ctx;
+
+	ctx = kmalloc(sizeof(struct cr_context), GFP_KERNEL);
+	if (ctx) {
+		int i;
+
+		ctx->cr_init_tsk = tsk;
+		ctx->cr_dump_file = file;
+		ctx->cr_write_error = 0;
+		for (i = 0; i < NR_CR_CTX_TYPES; i++)
+			INIT_LIST_HEAD(&ctx->cr_obj[i]);
+	}
+	return ctx;
+}
+
+void cr_context_destroy(struct cr_context *ctx)
+{
+	struct cr_object *obj, *tmp;
+	int i;
+
+	for (i = 0; i < NR_CR_CTX_TYPES; i++) {
+		for_each_cr_object_safe(ctx, obj, tmp, i) {
+			list_del(&obj->o_list);
+			cr_object_destroy(obj);
+		}
+	}
+	kfree(ctx);
+}
+
+struct cr_object *cr_find_obj_by_ptr(struct cr_context *ctx, const void *ptr, enum cr_context_obj_type type)
+{
+	struct cr_object *obj;
+
+	for_each_cr_object(ctx, obj, type) {
+		if (obj->o_obj == ptr)
+			return obj;
+	}
+	BUG();
+}
diff --git a/kernel/cr/cr.h b/kernel/cr/cr.h
new file mode 100644
index 0000000..526f24e
--- /dev/null
+++ b/kernel/cr/cr.h
@@ -0,0 +1,104 @@
+#ifndef __CR_H
+#define __CR_H
+#include <linux/list.h>
+
+struct ipc_namespace;
+struct mnt_namespace;
+struct net;
+
+struct cr_object {
+	/* entry in ->cr_* lists */
+	struct list_head	o_list;
+	/* number of references from collected objects */
+	unsigned long		o_count;
+	/* position in dumpfile, or CR_POS_UNDEF if not yet dumped */
+	loff_t			o_pos;
+	/* pointer to object being collected/dumped */
+	void			*o_obj;
+};
+
+/* Not visible to userspace! */
+enum cr_context_obj_type {
+	CR_CTX_TASK_STRUCT,
+	CR_CTX_NSPROXY,
+	CR_CTX_UTS_NS,
+#ifdef CONFIG_SYSVIPC
+	CR_CTX_IPC_NS,
+#endif
+	CR_CTX_MNT_NS,
+	CR_CTX_PID_NS,
+#ifdef CONFIG_NET
+	CR_CTX_NET_NS,
+#endif
+	CR_CTX_MM_STRUCT,
+	CR_CTX_FS_STRUCT,
+	CR_CTX_FILES_STRUCT,
+	CR_CTX_FILE,
+	CR_CTX_CRED,
+	CR_CTX_SIGNAL_STRUCT,
+	CR_CTX_SIGHAND_STRUCT,
+	CR_CTX_PID,
+
+	NR_CR_CTX_TYPES
+};
+
+struct cr_context {
+	struct task_struct	*cr_init_tsk;
+	struct file		*cr_dump_file;
+	int			cr_write_error;
+	struct list_head	cr_obj[NR_CR_CTX_TYPES];
+};
+
+#define for_each_cr_object(ctx, obj, type)				\
+	list_for_each_entry(obj, &ctx->cr_obj[type], o_list)
+#define for_each_cr_object_safe(ctx, obj, tmp, type)			\
+	list_for_each_entry_safe(obj, tmp, &ctx->cr_obj[type], o_list)
+struct cr_object *cr_find_obj_by_ptr(struct cr_context *ctx, const void *ptr, enum cr_context_obj_type type);
+
+struct cr_object *cr_object_create(void *data);
+static inline void cr_object_destroy(struct cr_object *obj)
+{
+	kfree(obj);
+}
+
+struct cr_context *cr_context_create(struct task_struct *tsk, struct file *file);
+void cr_context_destroy(struct cr_context *ctx);
+
+int cr_collect_tasks(struct cr_context *ctx, struct task_struct *init_tsk);
+
+int cr_collect_nsproxy(struct cr_context *ctx);
+int cr_collect_cred(struct cr_context *ctx);
+int cr_collect_pid(struct cr_context *ctx);
+int cr_collect_signal(struct cr_context *ctx);
+int cr_collect_sighand(struct cr_context *ctx);
+int cr_collect_mm(struct cr_context *ctx);
+int cr_collect_signal_struct(struct cr_context *ctx);
+int __cr_collect_file(struct cr_context *ctx, struct file *file);
+int cr_collect_files_struct(struct cr_context *ctx);
+int cr_collect_fs_struct(struct cr_context *ctx);
+
+void cr_write(struct cr_context *ctx, const void *buf, size_t count);
+void cr_align(struct cr_context *ctx);
+
+void *cr_prepare_image(unsigned int type, size_t len);
+
+int cr_dump_task_struct(struct cr_context *ctx, struct cr_object *obj);
+int cr_dump_nsproxy(struct cr_context *ctx, struct cr_object *obj);
+int cr_dump_uts_ns(struct cr_context *ctx, struct cr_object *obj);
+#ifdef CONFIG_SYSVIPC
+int cr_dump_ipc_ns(struct cr_context *ctx, struct cr_object *obj);
+#endif
+int cr_dump_mnt_ns(struct cr_context *ctx, struct cr_object *obj);
+int cr_dump_pid_ns(struct cr_context *ctx, struct cr_object *obj);
+#ifdef CONFIG_NET
+int cr_dump_net_ns(struct cr_context *ctx, struct cr_object *obj);
+#endif
+int cr_dump_mm_struct(struct cr_context *ctx, struct cr_object *obj);
+int cr_dump_signal_struct(struct cr_context *ctx, struct cr_object *obj);
+int cr_dump_sighand_struct(struct cr_context *ctx, struct cr_object *obj);
+int cr_dump_fs_struct(struct cr_context *ctx, struct cr_object *obj);
+int cr_dump_files_struct(struct cr_context *ctx, struct cr_object *obj);
+int cr_dump_file(struct cr_context *ctx, struct cr_object *obj);
+int cr_dump_cred(struct cr_context *ctx, struct cr_object *obj);
+int cr_dump_pid(struct cr_context *ctx, struct cr_object *obj);
+#endif
diff --git a/kernel/cr/rst-sys.c b/kernel/cr/rst-sys.c
new file mode 100644
index 0000000..35c3d15
--- /dev/null
+++ b/kernel/cr/rst-sys.c
@@ -0,0 +1,9 @@
+#include <linux/capability.h>
+#include <linux/syscalls.h>
+
+SYSCALL_DEFINE2(restart, int, fd, unsigned long, flags)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	return -ENOSYS;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 27dad29..da4fbf6 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -175,3 +175,6 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                             ` <20090301013304.GA2428-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
@ 2009-03-01 20:02                               ` Serge E. Hallyn
       [not found]                                 ` <20090301200231.GA25276-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 121+ messages in thread
From: Serge E. Hallyn @ 2009-03-01 20:02 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
Quoting Alexey Dobriyan (adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org):
> On Fri, Feb 27, 2009 at 01:31:12AM +0300, Alexey Dobriyan wrote:
> > This is collecting and start of dumping part of cleaned up OpenVZ C/R
> > implementation, FYI.
> 
> OK, here is second version which shows what to do with shared objects
> (cr_dump_nsproxy(), cr_dump_task_struct()), introduced more checks
> (still no unlinked files) and dumps some more information including
> structures connections (cr_pos_*)
> 
> Dumping pids in under thinking because in OpenVZ pids are saved as
> numbers due to CLONE_NEWPID is not allowed in container. In presense
> of multiple CLONE_NEWPID levels this must present a big problem. Looks
> like there is now way to not dump pids as separate object.
> 
> As result, struct cr_image_pid is variable-sized, don't know how this will
> play later.
> 
> Also, pid refcount check for external pointers is busted right now,
> because /proc inode pins struct pid, so there is almost always refcount
> vs ->o_count mismatch.
> 
> No restore yet. ;-)
Hi Alexey,
thanks for posting this.  Of course there are some predictable responses
(I like the simplicity of pure in-kernel, Dave will not :) but this
needs to be posted to make us talk about it.
A few more comments that came to me while looking it over:
1. cap_sys_admin check is unfortunate.  In discussions about Oren's
patchset we've agreed that not having that check from the outset forces
us to consider security with each new patch and feature, which is a good
thing.
2. if any tasks being checkpointed are frozen, checkpoint has the
side effect of thawing them, right?
3. wrt pids, i guess what you really want is to store the pids from
init_tsk's level down to the task's lowest pid, right?  Then you
manually set each of those on restart?  Any higher pids of course
don't matter.
4. do you have any thoughts on what to do with the mntns info at
restart?  Will you try to detect mounts which need to be re-created?
How?
5. Since you're always setting f_pos, this won't work straight over
a pipe?  Do you figure that's just not a worthwhile feature?
Were you saying (in response to Dave) that you're having private
discussions about whether to pursue posting this as an alternative
to Oren's patchset?  If so, any updates on those discussions?
thanks,
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                                 ` <20090301200231.GA25276-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-03-01 20:56                                   ` Alexey Dobriyan
  2009-03-01 22:21                                     ` Serge E. Hallyn
  2009-03-03 16:17                                     ` Cedric Le Goater
  0 siblings, 2 replies; 121+ messages in thread
From: Alexey Dobriyan @ 2009-03-01 20:56 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Ingo Molnar, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
On Sun, Mar 01, 2009 at 02:02:31PM -0600, Serge E. Hallyn wrote:
> Quoting Alexey Dobriyan (adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org):
> > On Fri, Feb 27, 2009 at 01:31:12AM +0300, Alexey Dobriyan wrote:
> > > This is collecting and start of dumping part of cleaned up OpenVZ C/R
> > > implementation, FYI.
> > 
> > OK, here is second version which shows what to do with shared objects
> > (cr_dump_nsproxy(), cr_dump_task_struct()), introduced more checks
> > (still no unlinked files) and dumps some more information including
> > structures connections (cr_pos_*)
> > 
> > Dumping pids in under thinking because in OpenVZ pids are saved as
> > numbers due to CLONE_NEWPID is not allowed in container. In presense
> > of multiple CLONE_NEWPID levels this must present a big problem. Looks
> > like there is now way to not dump pids as separate object.
> > 
> > As result, struct cr_image_pid is variable-sized, don't know how this will
> > play later.
> > 
> > Also, pid refcount check for external pointers is busted right now,
> > because /proc inode pins struct pid, so there is almost always refcount
> > vs ->o_count mismatch.
> > 
> > No restore yet. ;-)
> 
> Hi Alexey,
> 
> thanks for posting this.  Of course there are some predictable responses
> (I like the simplicity of pure in-kernel, Dave will not :) but this
> needs to be posted to make us talk about it.
> 
> A few more comments that came to me while looking it over:
> 
> 1. cap_sys_admin check is unfortunate.  In discussions about Oren's
> patchset we've agreed that not having that check from the outset forces
> us to consider security with each new patch and feature, which is a good
> thing.
Removing CAP_SYS_ADMIN on restore?
> 2. if any tasks being checkpointed are frozen, checkpoint has the
> side effect of thawing them, right?
Haven't tried, but should be a bug, yes. It will be "thaw or kill"
depending on "flags".
> 3. wrt pids, i guess what you really want is to store the pids from
> init_tsk's level down to the task's lowest pid, right?  Then you
> manually set each of those on restart?  Any higher pids of course
> don't matter.
Yes, numbers are really meant to be from init_tsk level.
> 4. do you have any thoughts on what to do with the mntns info at
> restart?  Will you try to detect mounts which need to be re-created?
> How?
Haven't thought, but it will be tricky for sure :^)
> 5. Since you're always setting f_pos, this won't work straight over
> a pipe?  Do you figure that's just not a worthwhile feature?
So far there were no loops when dumping data structures, but I _think_
there will be some, so seeking over dumpfile would be inevitable.
> Were you saying (in response to Dave) that you're having private
> discussions about whether to pursue posting this as an alternative
> to Oren's patchset?  If so, any updates on those discussions?
Right now, no.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-01 20:56                                   ` Alexey Dobriyan
@ 2009-03-01 22:21                                     ` Serge E. Hallyn
  2009-03-03 16:17                                     ` Cedric Le Goater
  1 sibling, 0 replies; 121+ messages in thread
From: Serge E. Hallyn @ 2009-03-01 22:21 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, linux-api, containers, hpa, linux-kernel,
	Dave Hansen, linux-mm, viro, mpm, Andrew Morton, torvalds, tglx,
	xemul
Quoting Alexey Dobriyan (adobriyan@gmail.com):
> On Sun, Mar 01, 2009 at 02:02:31PM -0600, Serge E. Hallyn wrote:
> > Quoting Alexey Dobriyan (adobriyan@gmail.com):
> > > On Fri, Feb 27, 2009 at 01:31:12AM +0300, Alexey Dobriyan wrote:
> > > > This is collecting and start of dumping part of cleaned up OpenVZ C/R
> > > > implementation, FYI.
> > > 
> > > OK, here is second version which shows what to do with shared objects
> > > (cr_dump_nsproxy(), cr_dump_task_struct()), introduced more checks
> > > (still no unlinked files) and dumps some more information including
> > > structures connections (cr_pos_*)
> > > 
> > > Dumping pids in under thinking because in OpenVZ pids are saved as
> > > numbers due to CLONE_NEWPID is not allowed in container. In presense
> > > of multiple CLONE_NEWPID levels this must present a big problem. Looks
> > > like there is now way to not dump pids as separate object.
> > > 
> > > As result, struct cr_image_pid is variable-sized, don't know how this will
> > > play later.
> > > 
> > > Also, pid refcount check for external pointers is busted right now,
> > > because /proc inode pins struct pid, so there is almost always refcount
> > > vs ->o_count mismatch.
> > > 
> > > No restore yet. ;-)
> > 
> > Hi Alexey,
> > 
> > thanks for posting this.  Of course there are some predictable responses
> > (I like the simplicity of pure in-kernel, Dave will not :) but this
> > needs to be posted to make us talk about it.
> > 
> > A few more comments that came to me while looking it over:
> > 
> > 1. cap_sys_admin check is unfortunate.  In discussions about Oren's
> > patchset we've agreed that not having that check from the outset forces
> > us to consider security with each new patch and feature, which is a good
> > thing.
> 
> Removing CAP_SYS_ADMIN on restore?
And checkpoint.
-serge
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-01 20:56                                   ` Alexey Dobriyan
  2009-03-01 22:21                                     ` Serge E. Hallyn
@ 2009-03-03 16:17                                     ` Cedric Le Goater
  2009-03-03 18:28                                       ` Serge E. Hallyn
  1 sibling, 1 reply; 121+ messages in thread
From: Cedric Le Goater @ 2009-03-03 16:17 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Serge E. Hallyn, linux-api, containers, mpm, linux-kernel,
	Dave Hansen, linux-mm, tglx, viro, hpa, Ingo Molnar, torvalds,
	Andrew Morton, xemul
>> 1. cap_sys_admin check is unfortunate.  In discussions about Oren's
>> patchset we've agreed that not having that check from the outset forces
>> us to consider security with each new patch and feature, which is a good
>> thing.
> 
> Removing CAP_SYS_ADMIN on restore?
we've kept the capabilities in our patchset but the user tools doing checkpoint
and restart are setcap'ed appropriately to be able to do different things like : 
	
	clone() the namespaces
	mount /dev/mqueue
	interact with net_ns
	etc.
at restart, the task are restarted through execve() so they loose their 
capabilities automatically.
but I think we could drop the CAP_SYS_ADMIN tests for some namespaces,
uts and ipc are good candidates. I guess network should require some 
privilege.  
C.  
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-03 16:17                                     ` Cedric Le Goater
@ 2009-03-03 18:28                                       ` Serge E. Hallyn
  0 siblings, 0 replies; 121+ messages in thread
From: Serge E. Hallyn @ 2009-03-03 18:28 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Alexey Dobriyan, linux-api, containers, mpm, linux-kernel,
	Dave Hansen, linux-mm, tglx, viro, hpa, Ingo Molnar, torvalds,
	Andrew Morton, xemul
Quoting Cedric Le Goater (legoater@free.fr):
> 
> >> 1. cap_sys_admin check is unfortunate.  In discussions about Oren's
> >> patchset we've agreed that not having that check from the outset forces
> >> us to consider security with each new patch and feature, which is a good
> >> thing.
> > 
> > Removing CAP_SYS_ADMIN on restore?
> 
> we've kept the capabilities in our patchset but the user tools doing checkpoint
> and restart are setcap'ed appropriately to be able to do different things like : 
> 	
> 	clone() the namespaces
> 	mount /dev/mqueue
> 	interact with net_ns
> 	etc.
Right, that stuff done in userspace requires capabilities.
> at restart, the task are restarted through execve() so they loose their 
> capabilities automatically.
> 
> but I think we could drop the CAP_SYS_ADMIN tests for some namespaces,
> uts and ipc are good candidates. I guess network should require some 
> privilege.  
Eric and i have talked about that a lot, and so far are continuing
to punt on it.  There are too many possibilities for subtle exploits
so I'm not suggesting changing those now.
But checkpoint and restart are entirely new.  If at each small step
we accept that an unprivileged user should be able to use it safely,
that will lead to a better design, i.e. doing may_ptrace before
checkpoint, and always doing access checks before re-creating a
resource.
If we *don't* do that, we'll have a big-stick setuid root checkpoint
and restart program which isn't at all trustworthy (bc it hasn't
received due scrutiny at each commit point), but must be trusted by
anyone wanting to use it.
And if we're too afraid to remove CAP_SYS_ADMIN checks from unsharing
one innocuous namespace, will we ever convince ourselves to remove it
from an established feature that can recreate every type of resource on
the system?
-serge
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-02-26 15:57                 ` Alexey Dobriyan
@ 2009-03-10 21:53                   ` Alexey Dobriyan
  2009-03-10 23:28                     ` Serge E. Hallyn
  2009-03-11  8:26                     ` Cedric Le Goater
  0 siblings, 2 replies; 121+ messages in thread
From: Alexey Dobriyan @ 2009-03-10 21:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, mpm, containers, hpa, linux-kernel, linux-mm, viro,
	linux-api, mingo, torvalds, tglx, xemul
On Thu, Feb 26, 2009 at 06:57:55PM +0300, Alexey Dobriyan wrote:
> On Thu, Feb 12, 2009 at 03:04:05PM -0800, Dave Hansen wrote:
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat 
> >  47 files changed, 20702 insertions(+)
> > 
> > One important thing that leaves out is the interaction that this code
> > has with the rest of the kernel.  That's critically important when
> > considering long-term maintenance, and I'd be curious how the OpenVZ
> > folks view it. 
> 
> OpenVZ as-is in some cases wants some functions to be made global
> (and if C/R code will be modular, exported). Or probably several
> iterators added.
> 
> But it's negligible amount of changes compared to main code.
Here is what C/R code wants from pid allocator.
With the introduction of hierarchical PID namespaces, struct pid can
have not one but many numbers -- tuple (pid_0, pid_1, ..., pid_N),
where pid_i is pid number in pid_ns which has level i.
Now root pid_ns of container has level n -- numbers from level n to N
inclusively should be dumped and restored.
During struct pid creation first n-1 numbers can be anything, because the're
outside of pid_ns, but the rest should be the same.
Code will be ifdeffed and commented, but anyhow, this is an example of
change C/R will require from the rest of the kernel.
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -182,6 +182,34 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -1;
 }
 
+static int set_pidmap(struct pid_namespace *pid_ns, pid_t pid)
+{
+	int offset;
+	struct pidmap *map;
+
+	offset = pid & BITS_PER_PAGE_MASK;
+	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
+	if (unlikely(!map->page)) {
+		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		/*
+		 * Free the page if someone raced with us
+		 * installing it:
+		 */
+		spin_lock_irq(&pidmap_lock);
+		if (map->page)
+			kfree(page);
+		else
+			map->page = page;
+		spin_unlock_irq(&pidmap_lock);
+		if (unlikely(!map->page))
+			return -ENOMEM;
+	}
+	if (test_and_set_bit(offset, map->page))
+		return -EBUSY;
+	atomic_dec(&map->nr_free);
+	return pid;
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
@@ -239,7 +267,7 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, int *cr_nr, unsigned int cr_level)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -253,7 +281,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		if (cr_nr && ns->level - i <= cr_level)
+			nr = set_pidmap(tmp, cr_nr[ns->level - i]);
+		else
+			nr = alloc_pidmap(tmp);
 		if (nr < 0)
 			goto out_free;
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-10 21:53                   ` Alexey Dobriyan
@ 2009-03-10 23:28                     ` Serge E. Hallyn
  2009-03-11  8:26                     ` Cedric Le Goater
  1 sibling, 0 replies; 121+ messages in thread
From: Serge E. Hallyn @ 2009-03-10 23:28 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Dave Hansen, linux-api, containers, mpm, linux-kernel, linux-mm,
	tglx, viro, hpa, Andrew Morton, torvalds, mingo, xemul
Quoting Alexey Dobriyan (adobriyan@gmail.com):
> On Thu, Feb 26, 2009 at 06:57:55PM +0300, Alexey Dobriyan wrote:
> > On Thu, Feb 12, 2009 at 03:04:05PM -0800, Dave Hansen wrote:
> > > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat 
> 
> > >  47 files changed, 20702 insertions(+)
> > > 
> > > One important thing that leaves out is the interaction that this code
> > > has with the rest of the kernel.  That's critically important when
> > > considering long-term maintenance, and I'd be curious how the OpenVZ
> > > folks view it. 
> > 
> > OpenVZ as-is in some cases wants some functions to be made global
> > (and if C/R code will be modular, exported). Or probably several
> > iterators added.
> > 
> > But it's negligible amount of changes compared to main code.
> 
> Here is what C/R code wants from pid allocator.
Yup.  Agreed.  That is exactly what I would have thought it would look
like.  We may have found the first bit of helper code we can all agree
on for c/r?  :)
Eric may disagree as he wanted to play games with
/proc/sys/kernel/pid_max, but that seems hard to pull off for nested
pid namespaces.
thanks,
-serge
> With the introduction of hierarchical PID namespaces, struct pid can
> have not one but many numbers -- tuple (pid_0, pid_1, ..., pid_N),
> where pid_i is pid number in pid_ns which has level i.
> 
> Now root pid_ns of container has level n -- numbers from level n to N
> inclusively should be dumped and restored.
> 
> During struct pid creation first n-1 numbers can be anything, because the're
> outside of pid_ns, but the rest should be the same.
> 
> Code will be ifdeffed and commented, but anyhow, this is an example of
> change C/R will require from the rest of the kernel.
> 
> 
> 
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -182,6 +182,34 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  	return -1;
>  }
> 
> +static int set_pidmap(struct pid_namespace *pid_ns, pid_t pid)
> +{
> +	int offset;
> +	struct pidmap *map;
> +
> +	offset = pid & BITS_PER_PAGE_MASK;
> +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> +	if (unlikely(!map->page)) {
> +		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +		/*
> +		 * Free the page if someone raced with us
> +		 * installing it:
> +		 */
> +		spin_lock_irq(&pidmap_lock);
> +		if (map->page)
> +			kfree(page);
> +		else
> +			map->page = page;
> +		spin_unlock_irq(&pidmap_lock);
> +		if (unlikely(!map->page))
> +			return -ENOMEM;
> +	}
> +	if (test_and_set_bit(offset, map->page))
> +		return -EBUSY;
> +	atomic_dec(&map->nr_free);
> +	return pid;
> +}
> +
>  int next_pidmap(struct pid_namespace *pid_ns, int last)
>  {
>  	int offset;
> @@ -239,7 +267,7 @@ void free_pid(struct pid *pid)
>  	call_rcu(&pid->rcu, delayed_put_pid);
>  }
> 
> -struct pid *alloc_pid(struct pid_namespace *ns)
> +struct pid *alloc_pid(struct pid_namespace *ns, int *cr_nr, unsigned int cr_level)
>  {
>  	struct pid *pid;
>  	enum pid_type type;
> @@ -253,7 +281,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> 
>  	tmp = ns;
>  	for (i = ns->level; i >= 0; i--) {
> -		nr = alloc_pidmap(tmp);
> +		if (cr_nr && ns->level - i <= cr_level)
> +			nr = set_pidmap(tmp, cr_nr[ns->level - i]);
> +		else
> +			nr = alloc_pidmap(tmp);
>  		if (nr < 0)
>  			goto out_free;
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-10 21:53                   ` Alexey Dobriyan
  2009-03-10 23:28                     ` Serge E. Hallyn
@ 2009-03-11  8:26                     ` Cedric Le Goater
  2009-03-12 14:53                       ` Serge E. Hallyn
  1 sibling, 1 reply; 121+ messages in thread
From: Cedric Le Goater @ 2009-03-11  8:26 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Dave Hansen, linux-api, containers, mpm, linux-kernel, linux-mm,
	tglx, viro, hpa, Andrew Morton, torvalds, mingo, xemul
Alexey Dobriyan wrote:
> On Thu, Feb 26, 2009 at 06:57:55PM +0300, Alexey Dobriyan wrote:
>> On Thu, Feb 12, 2009 at 03:04:05PM -0800, Dave Hansen wrote:
>>> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat 
> 
>>>  47 files changed, 20702 insertions(+)
>>>
>>> One important thing that leaves out is the interaction that this code
>>> has with the rest of the kernel.  That's critically important when
>>> considering long-term maintenance, and I'd be curious how the OpenVZ
>>> folks view it. 
>> OpenVZ as-is in some cases wants some functions to be made global
>> (and if C/R code will be modular, exported). Or probably several
>> iterators added.
>>
>> But it's negligible amount of changes compared to main code.
> 
> Here is what C/R code wants from pid allocator.
> 
> With the introduction of hierarchical PID namespaces, struct pid can
> have not one but many numbers -- tuple (pid_0, pid_1, ..., pid_N),
> where pid_i is pid number in pid_ns which has level i.
> 
> Now root pid_ns of container has level n -- numbers from level n to N
> inclusively should be dumped and restored.
> 
> During struct pid creation first n-1 numbers can be anything, because the're
> outside of pid_ns, but the rest should be the same.
> 
> Code will be ifdeffed and commented, but anyhow, this is an example of
> change C/R will require from the rest of the kernel.
> 
> 
> 
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -182,6 +182,34 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  	return -1;
>  }
>  
> +static int set_pidmap(struct pid_namespace *pid_ns, pid_t pid)
> +{
> +	int offset;
> +	struct pidmap *map;
> +
> +	offset = pid & BITS_PER_PAGE_MASK;
> +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> +	if (unlikely(!map->page)) {
> +		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +		/*
> +		 * Free the page if someone raced with us
> +		 * installing it:
> +		 */
> +		spin_lock_irq(&pidmap_lock);
> +		if (map->page)
> +			kfree(page);
> +		else
> +			map->page = page;
> +		spin_unlock_irq(&pidmap_lock);
> +		if (unlikely(!map->page))
> +			return -ENOMEM;
> +	}
> +	if (test_and_set_bit(offset, map->page))
> +		return -EBUSY;
> +	atomic_dec(&map->nr_free);
> +	return pid;
> +}
> +
>  int next_pidmap(struct pid_namespace *pid_ns, int last)
>  {
>  	int offset;
> @@ -239,7 +267,7 @@ void free_pid(struct pid *pid)
>  	call_rcu(&pid->rcu, delayed_put_pid);
>  }
>  
> -struct pid *alloc_pid(struct pid_namespace *ns)
> +struct pid *alloc_pid(struct pid_namespace *ns, int *cr_nr, unsigned int cr_level)
>  {
>  	struct pid *pid;
>  	enum pid_type type;
> @@ -253,7 +281,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  
>  	tmp = ns;
>  	for (i = ns->level; i >= 0; i--) {
> -		nr = alloc_pidmap(tmp);
> +		if (cr_nr && ns->level - i <= cr_level)
> +			nr = set_pidmap(tmp, cr_nr[ns->level - i]);
> +		else
> +			nr = alloc_pidmap(tmp);
>  		if (nr < 0)
>  			goto out_free;
This patch supposes that the process is restored in a state which took several 
clone(CLONE_NEWPID) to reach. if you replay these clone(), which is what restart
is at the end : an optimized replay, you would only need something like below. 
Index: 2.6.git/kernel/pid.c
===================================================================
--- 2.6.git.orig/kernel/pid.c
+++ 2.6.git/kernel/pid.c
@@ -122,12 +122,12 @@ static void free_pidmap(struct upid *upi
 	atomic_inc(&map->nr_free);
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t upid)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
 
-	pid = last + 1;
+	pid = upid ? upid : last + 1;
 	if (pid >= pid_max)
 		pid = RESERVED_PIDS;
 	offset = pid & BITS_PER_PAGE_MASK;
@@ -239,7 +239,7 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t next_pid)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -253,10 +253,15 @@ struct pid *alloc_pid(struct pid_namespa
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		nr = alloc_pidmap(tmp, next_pid);
 		if (nr < 0)
 			goto out_free;
 
+		/* The next_pid is only applicable for the ns namespace, not
+		 * its parents.
+		 */
+		next_pid = 0;
+
 		pid->numbers[i].nr = nr;
 		pid->numbers[i].ns = tmp;
 		tmp = tmp->parent;
Well, that's how we do it but I'm not against your patch. It fits our need also. 
It's just a bit intrusive for the pid bitmap. if we mix both path, we get something
like this fake patch, which is a bit less intrusive IMO. not tested though.
 
@@ -122,12 +122,12 @@ static void free_pidmap(struct upid *upi
 	atomic_inc(&map->nr_free);
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t upid)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
 
-	pid = last + 1;
+	pid = upid ? upid : last + 1;
 	if (pid >= pid_max)
 		pid = RESERVED_PIDS;
 	offset = pid & BITS_PER_PAGE_MASK;
@@ -239,7 +267,7 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, int *cr_nr, unsigned int cr_level)
 {
 	struct pid *pid;
 	enum pid_type type;
@@ -253,7 +281,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		if (cr_nr && ns->level - i <= cr_level)
+			nr = alloc_pidmap(tmp, cr_nr[ns->level - i]);
+			if (nr != cr_nr[ns->level - i])
+				return -EBUSY;
+		else
+			nr = alloc_pidmap(tmp);
 		if (nr < 0)
 			goto out_free;
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-11  8:26                     ` Cedric Le Goater
@ 2009-03-12 14:53                       ` Serge E. Hallyn
  2009-03-12 21:01                         ` Greg Kurz
  2009-03-13 15:47                         ` Cedric Le Goater
  0 siblings, 2 replies; 121+ messages in thread
From: Serge E. Hallyn @ 2009-03-12 14:53 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Alexey Dobriyan, linux-api, containers, hpa, linux-kernel,
	Dave Hansen, linux-mm, viro, mingo, mpm, tglx, torvalds,
	Andrew Morton, xemul
Quoting Cedric Le Goater (legoater@free.fr):
> Alexey Dobriyan wrote:
> > On Thu, Feb 26, 2009 at 06:57:55PM +0300, Alexey Dobriyan wrote:
> >> On Thu, Feb 12, 2009 at 03:04:05PM -0800, Dave Hansen wrote:
> >>> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat 
> > 
> >>>  47 files changed, 20702 insertions(+)
> >>>
> >>> One important thing that leaves out is the interaction that this code
> >>> has with the rest of the kernel.  That's critically important when
> >>> considering long-term maintenance, and I'd be curious how the OpenVZ
> >>> folks view it. 
> >> OpenVZ as-is in some cases wants some functions to be made global
> >> (and if C/R code will be modular, exported). Or probably several
> >> iterators added.
> >>
> >> But it's negligible amount of changes compared to main code.
> > 
> > Here is what C/R code wants from pid allocator.
> > 
> > With the introduction of hierarchical PID namespaces, struct pid can
> > have not one but many numbers -- tuple (pid_0, pid_1, ..., pid_N),
> > where pid_i is pid number in pid_ns which has level i.
> > 
> > Now root pid_ns of container has level n -- numbers from level n to N
> > inclusively should be dumped and restored.
> > 
> > During struct pid creation first n-1 numbers can be anything, because the're
> > outside of pid_ns, but the rest should be the same.
> > 
> > Code will be ifdeffed and commented, but anyhow, this is an example of
> > change C/R will require from the rest of the kernel.
> > 
> > 
> > 
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -182,6 +182,34 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> >  	return -1;
> >  }
> >  
> > +static int set_pidmap(struct pid_namespace *pid_ns, pid_t pid)
> > +{
> > +	int offset;
> > +	struct pidmap *map;
> > +
> > +	offset = pid & BITS_PER_PAGE_MASK;
> > +	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
> > +	if (unlikely(!map->page)) {
> > +		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > +		/*
> > +		 * Free the page if someone raced with us
> > +		 * installing it:
> > +		 */
> > +		spin_lock_irq(&pidmap_lock);
> > +		if (map->page)
> > +			kfree(page);
> > +		else
> > +			map->page = page;
> > +		spin_unlock_irq(&pidmap_lock);
> > +		if (unlikely(!map->page))
> > +			return -ENOMEM;
> > +	}
> > +	if (test_and_set_bit(offset, map->page))
> > +		return -EBUSY;
> > +	atomic_dec(&map->nr_free);
> > +	return pid;
> > +}
> > +
> >  int next_pidmap(struct pid_namespace *pid_ns, int last)
> >  {
> >  	int offset;
> > @@ -239,7 +267,7 @@ void free_pid(struct pid *pid)
> >  	call_rcu(&pid->rcu, delayed_put_pid);
> >  }
> >  
> > -struct pid *alloc_pid(struct pid_namespace *ns)
> > +struct pid *alloc_pid(struct pid_namespace *ns, int *cr_nr, unsigned int cr_level)
> >  {
> >  	struct pid *pid;
> >  	enum pid_type type;
> > @@ -253,7 +281,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >  
> >  	tmp = ns;
> >  	for (i = ns->level; i >= 0; i--) {
> > -		nr = alloc_pidmap(tmp);
> > +		if (cr_nr && ns->level - i <= cr_level)
> > +			nr = set_pidmap(tmp, cr_nr[ns->level - i]);
> > +		else
> > +			nr = alloc_pidmap(tmp);
> >  		if (nr < 0)
> >  			goto out_free;
> 
> This patch supposes that the process is restored in a state which took several 
> clone(CLONE_NEWPID) to reach. if you replay these clone(), which is what restart
> is at the end : an optimized replay, you would only need something like below. 
No, what you're suggesting does not suffice.
Call
(5591,3,1) the task knows as 5591 in the init_pid_ns, 3 in a child pid
ns, and 1 in grandchild pid_ns created from there.  Now assume we are
checkpointing tasks T1=(5592,1), and T2=(5594,3,1).
We don't care about the first number in the tuples, so they will be
random numbers after the recreate.  But we do care about the second
numbers.  But specifying CLONE_NEWPID while recreating the process tree
in userspace does not allow you to specify the 3 in (5594,3,1).
Or are you suggesting that you'll do a dummy clone of (5594,2) so that
the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
-serge
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-12 14:53                       ` Serge E. Hallyn
@ 2009-03-12 21:01                         ` Greg Kurz
  2009-03-12 21:21                           ` Serge E. Hallyn
  2009-03-13 15:47                         ` Cedric Le Goater
  1 sibling, 1 reply; 121+ messages in thread
From: Greg Kurz @ 2009-03-12 21:01 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Cedric Le Goater, Andrew Morton, linux-api, containers, mpm,
	linux-kernel, Dave Hansen, linux-mm, tglx, viro, hpa, mingo,
	torvalds, Alexey Dobriyan, xemul
On Thu, 2009-03-12 at 09:53 -0500, Serge E. Hallyn wrote:
> Or are you suggesting that you'll do a dummy clone of (5594,2) so that
> the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
> 
Of course not but one should be able to tell clone() to pick a specific
pid.
-- 
Gregory Kurz                                     gkurz@fr.ibm.com
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)534 638 479                           Fax +33 (0)561 400 420
"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-12 21:01                         ` Greg Kurz
@ 2009-03-12 21:21                           ` Serge E. Hallyn
  2009-03-13  4:29                             ` Ying Han
  0 siblings, 1 reply; 121+ messages in thread
From: Serge E. Hallyn @ 2009-03-12 21:21 UTC (permalink / raw)
  To: Greg Kurz
  Cc: Cedric Le Goater, Andrew Morton, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, xemul-GEFAQzZX7r8dnm+yROfE0A
Quoting Greg Kurz (gkurz-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org):
> On Thu, 2009-03-12 at 09:53 -0500, Serge E. Hallyn wrote:
> > Or are you suggesting that you'll do a dummy clone of (5594,2) so that
> > the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
> > 
> 
> Of course not
Ok - someone *did* argue that at some point I think...
> but one should be able to tell clone() to pick a specific
> pid.
Can you explain exactly how?  I must be missing something clever.
-serge
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-02-13 23:28       ` Andrew Morton
  2009-02-14 23:08         ` Ingo Molnar
       [not found]         ` <20090213152836.0fbbfa7d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2009-03-13  2:45         ` Oren Laadan
  2009-03-13  3:57           ` Oren Laadan
  2 siblings, 1 reply; 121+ messages in thread
From: Oren Laadan @ 2009-03-13  2:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Hansen, mingo, linux-api, containers, linux-kernel, linux-mm,
	torvalds, viro, hpa, tglx
Hi,
Just got back from 3 weeks with practically no internet, and I see
that I missed a big party !
Trying to catch up with what's been said so far --
"An app really has to know whether it can reliably checkpoint+restart."
It was suggested (Dave) to either have an "uncheckpointable" flag at containter,
or process, or resource level. Another suggestion (Serge, Alexey) was to let
the app try to checkpoint and return an error.
For what it's worth, I vote for the latter. Have the checkpoint code always
return an error if the checkpoint cannot be taken. If checkpoint succeeds
then the app/user is guaranteed that restart will succeed (if it is given
the right starting conditions, e.g. correct file system view).
To figure out what/when went wrong, the c/r code can indicate the _reason_
to the failure (e.g. output to the console, or other means) so that the
frustrated user/developer/app can report it. I also think it's cleaner as
it keep c/r consideration within the c/r subsystem and not scattered around
different locations in the kernel.
Andrew Morton wrote:
> On Thu, 12 Feb 2009 10:11:22 -0800
> Dave Hansen <dave@linux.vnet.ibm.com> wrote:
> 
>> ...
>>
>>> - In bullet-point form, what features are missing, and should be added?
>>  * support for more architectures than i386
>>  * file descriptors:
>>   * sockets (network, AF_UNIX, etc...)
>>   * devices files
>>   * shmfs, hugetlbfs
>>   * epoll
>>   * unlinked files
>>  * Filesystem state
>>   * contents of files
>>   * mount tree for individual processes
>>  * flock
>>  * threads and sessions
>>  * CPU and NUMA affinity
>>  * sys_remap_file_pages()
>>
>> This is a very minimal list that is surely incomplete and sure to grow.
> 
> That's a worry.
> 
>>> For extra marks:
>>>
>>> - Will any of this involve non-trivial serialisation of kernel
>>>   objects?  If so, that's getting into the
>>>   unacceptably-expensive-to-maintain space, I suspect.
>> We have some structures that are certainly tied to the kernel-internal
>> ones.  However, we are certainly *not* simply writing kernel structures
>> to userspace.  We could do that with /dev/mem.  We are carefully pulling
>> out the minimal bits of information from the kernel structures that we
>> *need* to recreate the function of the structure at restart.  There is a
>> maintenance burden here but, so far, that burden is almost entirely in
>> checkpoint/*.c.  We intend to test this functionality thoroughly to
>> ensure that we don't regress once we have integrated it.
> 
> I guess my question can be approximately simplified to: "will it end up
> looking like openvz"?  (I don't believe that we know of any other way
> of implementing this?)
> 
> Because if it does then that's a concern, because my assessment when I
> looked at that code (a number of years ago) was that having code of
> that nature in mainline would be pretty costly to us, and rather
> unwelcome.
I originally implemented c/r for linux as as kernel module, without
requiring any changes from the kernel. (Doing the namespaces as a kernel
module was much harder). For more details, see:
	https://www.ncl.cs.columbia.edu/research/migrate
The current set of patches is the beginning of a re-implementation
based on that work and other lessons learned, as well as feedback and
collaboration with other players.
I am confident that the the vast majority of the code will end up as a
separate "subsystem", and that relatively few changes will be required
from the existing kernel.
> The broadest form of the question is "will we end up regretting having
> done this".
I bet that once this works for a critical mass of apps/users - we will
never regret having done this. (We may regret - and fix - having done
specific part this way or another).
> 
> If we can arrange for the implementation to sit quietly over in a
> corner with a team of people maintaining it and not screwing up other
> people's work then I guess we'd be OK - if it breaks then the breakage
> is localised.
In my experience, there is very little code of the c/r that affects
other parts of the kernel, it's mostly isolated. So I believe this
will be the case.
> 
> And it's not just a matter of "does the diffstat only affect a single
> subdirectory".  We also should watch out for the imposition of new
> rules which kernel code must follow.  "you can't do that, because we
> can't serialise it", or something.
> 
> Similar to the way in which perfectly correct and normal kernel
> sometimes has to be changed because it unexpectedly upsets the -rt
> patch.
> 
> Do you expect that any restrictions of this type will be imposed?
> 
That an excellent point. Again, judging from past experience -
it is possible (but not always pretty) to implement c/r as a kernel
module, without requiring _any_ kernel changes. I can't think of
any such restrictions, but we'll certainly have to keep our eyes
open.
Oren
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
  2009-03-13  2:45         ` Oren Laadan
@ 2009-03-13  3:57           ` Oren Laadan
  0 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-03-13  3:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api, containers, linux-kernel, Dave Hansen, linux-mm, viro,
	hpa, mingo, torvalds, tglx
Oren Laadan wrote:
> Hi,
> 
> Just got back from 3 weeks with practically no internet, and I see
> that I missed a big party !
> 
> Trying to catch up with what's been said so far --
[...]
>>>>
>>>> - Will any of this involve non-trivial serialisation of kernel
>>>>   objects?  If so, that's getting into the
>>>>   unacceptably-expensive-to-maintain space, I suspect.
>>> We have some structures that are certainly tied to the kernel-internal
>>> ones.  However, we are certainly *not* simply writing kernel structures
>>> to userspace.  We could do that with /dev/mem.  We are carefully pulling
>>> out the minimal bits of information from the kernel structures that we
>>> *need* to recreate the function of the structure at restart.  There is a
>>> maintenance burden here but, so far, that burden is almost entirely in
>>> checkpoint/*.c.  We intend to test this functionality thoroughly to
>>> ensure that we don't regress once we have integrated it.
>> I guess my question can be approximately simplified to: "will it end up
>> looking like openvz"?  (I don't believe that we know of any other way
>> of implementing this?)
>>
>> Because if it does then that's a concern, because my assessment when I
>> looked at that code (a number of years ago) was that having code of
>> that nature in mainline would be pretty costly to us, and rather
>> unwelcome.
> 
> I originally implemented c/r for linux as as kernel module, without
> requiring any changes from the kernel. (Doing the namespaces as a kernel
> module was much harder). For more details, see:
> 	https://www.ncl.cs.columbia.edu/research/migrate
oops... I meant the following link:
	http://www.ncl.cs.columbia.edu/research/migration/
see, for example, the papers from DejaView (SOSP 07) and Zap (USENIX 07).
Oren.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-12 21:21                           ` Serge E. Hallyn
@ 2009-03-13  4:29                             ` Ying Han
  2009-03-13  5:34                               ` Sukadev Bhattiprolu
                                                 ` (2 more replies)
  0 siblings, 3 replies; 121+ messages in thread
From: Ying Han @ 2009-03-13  4:29 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Greg Kurz, Cedric Le Goater, Andrew Morton, linux-api, containers,
	mpm, linux-kernel, Dave Hansen, linux-mm, tglx, viro, hpa, mingo,
	torvalds, Alexey Dobriyan, xemul
Hi Serge:
I made a patch based on Oren's tree recently which implement a new
syscall clone_with_pid. I tested with checkpoint/restart process tree
and it works as expected.
This patch has some hack in it which i made a copy of libc's clone and
made modifications of passing one more argument(pid number). I will
try to clean up the code and do more testing.
New syscall clone_with_pid
Implement a new syscall which clone a thread with a preselected pid number.
clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
			CLONE_WITH_PID|SIGCHLD, pid, NULL);
Signed-off-by: Ying Han <yinghan@google.com>
diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 87803da..b5a1b03 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -26,6 +26,7 @@ asmlinkage int sys_fork(struct pt_regs);
 asmlinkage int sys_clone(struct pt_regs);
 asmlinkage int sys_vfork(struct pt_regs);
 asmlinkage int sys_execve(struct pt_regs);
+asmlinkage int sys_clone_with_pid(struct pt_regs);
 /* kernel/signal_32.c */
 asmlinkage int sys_sigsuspend(int, int, old_sigset_t);
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32
index a5f9e09..f10ca0e 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -340,6 +340,7 @@
 #define __NR_inotify_init1	332
 #define __NR_checkpoint		333
 #define __NR_restart		334
+#define __NR_clone_with_pid	335
 #ifdef __KERNEL__
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 0a1302f..88ae634 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -8,7 +8,6 @@
 /*
  * This file handles the architecture-dependent parts of process handling..
  */
-
 #include <stdarg.h>
 #include <linux/cpu.h>
@@ -652,6 +651,28 @@ asmlinkage int sys_clone(struct pt_regs regs)
 	return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr);
 }
+/**
+ * sys_clone_with_pid - clone a thread with pre-select pid number.
+ */
+asmlinkage int sys_clone_with_pid(struct pt_regs regs)
+{
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr, *child_tidptr;
+	pid_t pid_nr;
+
+	clone_flags = regs.bx;
+	newsp = regs.cx;
+	parent_tidptr = (int __user *)regs.dx;
+	child_tidptr = (int __user *)regs.di;
+	pid_nr = regs.bp;
+
+	if (!newsp)
+		newsp = regs.sp;
+	return do_fork(clone_flags, newsp, ®s, pid_nr, parent_tidptr,
+			child_tidptr);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_tabl
index 5543136..5191117 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -334,3 +334,4 @@ ENTRY(sys_call_table)
 	.long sys_inotify_init1
 	.long sys_checkpoint
 	.long sys_restart
+	.long sys_clone_with_pid
diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
index 50bde9a..a4aee65 100644
--- a/arch/x86/mm/checkpoint.c
+++ b/arch/x86/mm/checkpoint.c
@@ -7,7 +7,6 @@
  *  License.  See the file COPYING in the main directory of the Linux
  *  distribution for more details.
  */
-
 #include <asm/desc.h>
 #include <asm/i387.h>
diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
index 64155de..b7de611 100644
--- a/checkpoint/checkpoint.c
+++ b/checkpoint/checkpoint.c
@@ -8,6 +8,7 @@
  *  distribution for more details.
  */
+#define DEBUG
 #include <linux/version.h>
 #include <linux/sched.h>
 #include <linux/ptrace.h>
@@ -564,3 +565,4 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
  out:
 	return ret;
 }
+
diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
index e3097ac..a8c5ad5 100644
--- a/checkpoint/ckpt_file.c
+++ b/checkpoint/ckpt_file.c
@@ -7,7 +7,7 @@
  *  License.  See the file COPYING in the main directory of the Linux
  *  distribution for more details.
  */
-
+#define DEBUG
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/file.h>
diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
index 4925ff2..ca5840b 100644
--- a/checkpoint/ckpt_mem.c
+++ b/checkpoint/ckpt_mem.c
@@ -7,7 +7,7 @@
  *  License.  See the file COPYING in the main directory of the Linux
  *  distribution for more details.
  */
-
+#define DEBUG
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
diff --git a/checkpoint/restart.c b/checkpoint/restart.c
index 7ec4de4..30e43c2 100644
--- a/checkpoint/restart.c
+++ b/checkpoint/restart.c
@@ -8,6 +8,7 @@
  *  distribution for more details.
  */
+#define DEBUG
 #include <linux/version.h>
 #include <linux/sched.h>
 #include <linux/wait.h>
@@ -242,7 +243,7 @@ static int cr_read_task_struct(struct cr_ctx *ctx)
 		memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
 	}
 	kfree(buf);
-
+	pr_debug("read task %s\n", t->comm);
 	/* FIXME: restore remaining relevant task_struct fields */
  out:
 	cr_hbuf_put(ctx, sizeof(*hh));
diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
index f44b081..755e40e 100644
--- a/checkpoint/rstr_file.c
+++ b/checkpoint/rstr_file.c
@@ -7,7 +7,7 @@
  *  License.  See the file COPYING in the main directory of the Linux
  *  distribution for more details.
  */
-
+#define DEBUG
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/fs.h>
diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
index 4d5ce1a..8330468 100644
--- a/checkpoint/rstr_mem.c
+++ b/checkpoint/rstr_mem.c
@@ -7,7 +7,7 @@
  *  License.  See the file COPYING in the main directory of the Linux
  *  distribution for more details.
  */
-
+#define DEBUG
 #include <linux/kernel.h>
 #include <linux/sched.h>
 #include <linux/fcntl.h>
diff --git a/checkpoint/sys.c b/checkpoint/sys.c
index f26b0c6..d1a5394 100644
--- a/checkpoint/sys.c
+++ b/checkpoint/sys.c
@@ -7,7 +7,7 @@
  *  License.  See the file COPYING in the main directory of the Linux
  *  distribution for more details.
  */
-
+#define DEBUG
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/kernel.h>
@@ -263,7 +263,6 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned
 		return PTR_ERR(ctx);
 	ret = do_checkpoint(ctx, pid);
-
 	if (!ret)
 		ret = ctx->crid;
@@ -304,3 +303,4 @@ asmlinkage long sys_restart(int crid, int fd, unsigned lon
 	cr_ctx_put(ctx);
 	return ret;
 }
+
diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 217cf6e..bc2c202 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -114,7 +114,6 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_
 extern int do_restart(struct cr_ctx *ctx, pid_t pid);
 extern int cr_read_mm(struct cr_ctx *ctx);
 extern int cr_read_files(struct cr_ctx *ctx);
-
 #ifdef pr_fmt
 #undef pr_fmt
 #endif
diff --git a/include/linux/pid.h b/include/linux/pid.h
index d7e98ff..86e2f61 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr);
 extern void free_pid(struct pid *pid);
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0150e90..7fb4e28 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -28,6 +28,7 @@
 #define CLONE_NEWPID		0x20000000	/* New pid namespace */
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
+#define CLONE_WITH_PID		0x00001000	/* Clone with pre-select PID */
 /*
  * Scheduling policies
diff --git a/kernel/exit.c b/kernel/exit.c
index 2d8be7e..4baf651 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -3,7 +3,7 @@
  *
  *  Copyright (C) 1991, 1992  Linus Torvalds
  */
-
+#define DEBUG
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/interrupt.h>
@@ -1676,6 +1676,7 @@ static long do_wait(enum pid_type type, struct pid *pid,
 	DECLARE_WAITQUEUE(wait, current);
 	struct task_struct *tsk;
 	int retval;
+	int level;
 	trace_sched_process_wait(pid);
@@ -1708,7 +1709,6 @@ repeat:
 			retval = tsk_result;
 			goto end;
 		}
-
 		if (options & __WNOTHREAD)
 			break;
 		tsk = next_thread(tsk);
@@ -1817,7 +1817,6 @@ asmlinkage long sys_wait4(pid_t upid, int __user *stat_a
 		type = PIDTYPE_PID;
 		pid = find_get_pid(upid);
 	}
-
 	ret = do_wait(type, pid, options | WEXITED, NULL, stat_addr, ru);
 	put_pid(pid);
diff --git a/kernel/fork.c b/kernel/fork.c
index 085ce56..262ae1e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -10,7 +10,7 @@
  * Fork is rather simple, once you get the hang of it, but the memory
  * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
  */
-
+#define DEBUG
 #include <linux/slab.h>
 #include <linux/init.h>
 #include <linux/unistd.h>
@@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t clone_pid = stack_size;
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
+	/* We only allow the clone_with_pid when a new pid namespace is
+	 * created. FIXME: how to restrict it.
+	 */
+	if ((clone_flags & CLONE_NEWPID) && (clone_flags & CLONE_WITH_PID))
+		return ERR_PTR(-EINVAL);
+	if ((clone_flags & CLONE_WITH_PID) && (clone_pid <= 1))
+		return ERR_PTR(-EINVAL);
+
 	/*
 	 * Thread groups must share signals as well, and detached threads
 	 * can only be started up within the thread group.
@@ -1135,7 +1144,10 @@ static struct task_struct *copy_process(unsigned long c
 	if (pid != &init_struct_pid) {
 		retval = -ENOMEM;
-		pid = alloc_pid(task_active_pid_ns(p));
+		if (clone_flags & CLONE_WITH_PID)
+			pid = alloc_pid(task_active_pid_ns(p), clone_pid);
+		else
+			pid = alloc_pid(task_active_pid_ns(p), 0);
 		if (!pid)
 			goto bad_fork_cleanup_io;
@@ -1162,6 +1174,8 @@ static struct task_struct *copy_process(unsigned long cl
 	 * Clear TID on mm_release()?
 	 */
 	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr: NU
+
+
 #ifdef CONFIG_FUTEX
 	p->robust_list = NULL;
 #ifdef CONFIG_COMPAT
diff --git a/kernel/pid.c b/kernel/pid.c
index 064e76a..0facf05 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -25,7 +25,7 @@
  *     Many thanks to Oleg Nesterov for comments and help
  *
  */
-
+#define DEBUG
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/slab.h>
@@ -122,12 +122,15 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t pid_nr)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
-	pid = last + 1;
+	if (pid_nr)
+		pid = pid_nr;
+	else
+		pid = last + 1;
 	if (pid >= pid_max)
 		pid = RESERVED_PIDS;
 	offset = pid & BITS_PER_PAGE_MASK;
@@ -153,9 +156,12 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
+					if (!pid_nr)
+						pid_ns->last_pid = pid;
 					return pid;
 				}
+				if (pid_nr)
+					return -1;
 				offset = find_next_offset(map, offset);
 				pid = mk_pid(pid_ns, map, offset);
 			/*
@@ -239,21 +245,25 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	int level = ns->level;
+
+	if (pid_nr >= pid_max)
+		return NULL;
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid)
 		goto out;
-	tmp = ns;
-	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+	tmp = ns->parent;
+	for (i = level-1; i >= 0; i--) {
+		nr = alloc_pidmap(tmp, 0);
 		if (nr < 0)
 			goto out_free;
@@ -262,6 +272,14 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 		tmp = tmp->parent;
 	}
+	nr = alloc_pidmap(ns, pid_nr);
+	if (nr < 0)
+		goto out_free;
+	pid->numbers[level].nr = nr;
+	pid->numbers[level].ns = ns;
+	if (nr == pid_nr)
+		pr_debug("nr == pid_nr == %d\n", nr);
+
 	get_pid_ns(ns);
 	pid->level = ns->level;
 	atomic_set(&pid->count, 1);
On Thu, Mar 12, 2009 at 2:21 PM, Serge E. Hallyn <serue@us.ibm.com> wrote:
>
> Quoting Greg Kurz (gkurz@fr.ibm.com):
> > On Thu, 2009-03-12 at 09:53 -0500, Serge E. Hallyn wrote:
> > > Or are you suggesting that you'll do a dummy clone of (5594,2) so that
> > > the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
> > >
> >
> > Of course not
>
> Ok - someone *did* argue that at some point I think...
>
> > but one should be able to tell clone() to pick a specific
> > pid.
>
> Can you explain exactly how?  I must be missing something clever.
>
> -serge
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13  4:29                             ` Ying Han
@ 2009-03-13  5:34                               ` Sukadev Bhattiprolu
       [not found]                                 ` <20090313053458.GA28833-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-03-13 17:27                                 ` Linus Torvalds
       [not found]                               ` <604427e00903122129y37ad791aq5fe7ef2552415da9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-03-13 17:37                               ` Serge E. Hallyn
  2 siblings, 2 replies; 121+ messages in thread
From: Sukadev Bhattiprolu @ 2009-03-13  5:34 UTC (permalink / raw)
  To: Ying Han
  Cc: Serge E. Hallyn, linux-api, containers, hpa, linux-kernel,
	Dave Hansen, linux-mm, viro, mingo, mpm, Andrew Morton, xemul,
	torvalds, tglx, Alexey Dobriyan
Ying Han [yinghan@google.com] wrote:
| Hi Serge:
| I made a patch based on Oren's tree recently which implement a new
| syscall clone_with_pid. I tested with checkpoint/restart process tree
| and it works as expected.
Yes, I think we had a version of clone() with pid a while ago.
But it would be easier to review if you break it up into smaller
patches. and remove the unnecessary diffs in this patch like...
| This patch has some hack in it which i made a copy of libc's clone and
| made modifications of passing one more argument(pid number). I will
| try to clean up the code and do more testing.
| 
| New syscall clone_with_pid
| Implement a new syscall which clone a thread with a preselected pid number.
| 
| clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
| 			CLONE_WITH_PID|SIGCHLD, pid, NULL);
| 
| Signed-off-by: Ying Han <yinghan@google.com>
| 
| diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
| index 87803da..b5a1b03 100644
| --- a/arch/x86/include/asm/syscalls.h
| +++ b/arch/x86/include/asm/syscalls.h
| @@ -26,6 +26,7 @@ asmlinkage int sys_fork(struct pt_regs);
|  asmlinkage int sys_clone(struct pt_regs);
|  asmlinkage int sys_vfork(struct pt_regs);
|  asmlinkage int sys_execve(struct pt_regs);
| +asmlinkage int sys_clone_with_pid(struct pt_regs);
| 
|  /* kernel/signal_32.c */
|  asmlinkage int sys_sigsuspend(int, int, old_sigset_t);
| diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32
| index a5f9e09..f10ca0e 100644
| --- a/arch/x86/include/asm/unistd_32.h
| +++ b/arch/x86/include/asm/unistd_32.h
| @@ -340,6 +340,7 @@
|  #define __NR_inotify_init1	332
|  #define __NR_checkpoint		333
|  #define __NR_restart		334
| +#define __NR_clone_with_pid	335
| 
|  #ifdef __KERNEL__
| 
| diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
| index 0a1302f..88ae634 100644
| --- a/arch/x86/kernel/process_32.c
| +++ b/arch/x86/kernel/process_32.c
| @@ -8,7 +8,6 @@
|  /*
|   * This file handles the architecture-dependent parts of process handling..
|   */
| -
these
|  #include <stdarg.h>
| 
|  #include <linux/cpu.h>
| @@ -652,6 +651,28 @@ asmlinkage int sys_clone(struct pt_regs regs)
|  	return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr);
|  }
| 
| +/**
| + * sys_clone_with_pid - clone a thread with pre-select pid number.
| + */
| +asmlinkage int sys_clone_with_pid(struct pt_regs regs)
| +{
| +	unsigned long clone_flags;
| +	unsigned long newsp;
| +	int __user *parent_tidptr, *child_tidptr;
| +	pid_t pid_nr;
| +
| +	clone_flags = regs.bx;
| +	newsp = regs.cx;
| +	parent_tidptr = (int __user *)regs.dx;
| +	child_tidptr = (int __user *)regs.di;
| +	pid_nr = regs.bp;
| +
| +	if (!newsp)
| +		newsp = regs.sp;
| +	return do_fork(clone_flags, newsp, ®s, pid_nr, parent_tidptr,
| +			child_tidptr);
| +}
| +
|  /*
|   * This is trivial, and on the face of it looks like it
|   * could equally well be done in user mode.
| diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_tabl
| index 5543136..5191117 100644
| --- a/arch/x86/kernel/syscall_table_32.S
| +++ b/arch/x86/kernel/syscall_table_32.S
| @@ -334,3 +334,4 @@ ENTRY(sys_call_table)
|  	.long sys_inotify_init1
|  	.long sys_checkpoint
|  	.long sys_restart
| +	.long sys_clone_with_pid
| diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
| index 50bde9a..a4aee65 100644
| --- a/arch/x86/mm/checkpoint.c
| +++ b/arch/x86/mm/checkpoint.c
| @@ -7,7 +7,6 @@
|   *  License.  See the file COPYING in the main directory of the Linux
|   *  distribution for more details.
|   */
| -
|  #include <asm/desc.h>
|  #include <asm/i387.h>
| 
| diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
| index 64155de..b7de611 100644
| --- a/checkpoint/checkpoint.c
| +++ b/checkpoint/checkpoint.c
| @@ -8,6 +8,7 @@
|   *  distribution for more details.
|   */
| 
| +#define DEBUG
|  #include <linux/version.h>
|  #include <linux/sched.h>
|  #include <linux/ptrace.h>
| @@ -564,3 +565,4 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
|   out:
|  	return ret;
|  }
| +
| diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
| index e3097ac..a8c5ad5 100644
| --- a/checkpoint/ckpt_file.c
| +++ b/checkpoint/ckpt_file.c
| @@ -7,7 +7,7 @@
|   *  License.  See the file COPYING in the main directory of the Linux
|   *  distribution for more details.
|   */
| -
| +#define DEBUG
|  #include <linux/kernel.h>
|  #include <linux/sched.h>
|  #include <linux/file.h>
| diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
| index 4925ff2..ca5840b 100644
| --- a/checkpoint/ckpt_mem.c
| +++ b/checkpoint/ckpt_mem.c
| @@ -7,7 +7,7 @@
|   *  License.  See the file COPYING in the main directory of the Linux
|   *  distribution for more details.
|   */
| -
| +#define DEBUG
|  #include <linux/kernel.h>
|  #include <linux/sched.h>
|  #include <linux/slab.h>
| diff --git a/checkpoint/restart.c b/checkpoint/restart.c
| index 7ec4de4..30e43c2 100644
| --- a/checkpoint/restart.c
| +++ b/checkpoint/restart.c
| @@ -8,6 +8,7 @@
|   *  distribution for more details.
|   */
| 
| +#define DEBUG
|  #include <linux/version.h>
|  #include <linux/sched.h>
|  #include <linux/wait.h>
| @@ -242,7 +243,7 @@ static int cr_read_task_struct(struct cr_ctx *ctx)
|  		memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
|  	}
|  	kfree(buf);
| -
| +	pr_debug("read task %s\n", t->comm);
|  	/* FIXME: restore remaining relevant task_struct fields */
|   out:
|  	cr_hbuf_put(ctx, sizeof(*hh));
| diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
| index f44b081..755e40e 100644
| --- a/checkpoint/rstr_file.c
| +++ b/checkpoint/rstr_file.c
| @@ -7,7 +7,7 @@
|   *  License.  See the file COPYING in the main directory of the Linux
|   *  distribution for more details.
|   */
| -
| +#define DEBUG
|  #include <linux/kernel.h>
|  #include <linux/sched.h>
|  #include <linux/fs.h>
| diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
| index 4d5ce1a..8330468 100644
| --- a/checkpoint/rstr_mem.c
| +++ b/checkpoint/rstr_mem.c
| @@ -7,7 +7,7 @@
|   *  License.  See the file COPYING in the main directory of the Linux
|   *  distribution for more details.
|   */
| -
| +#define DEBUG
|  #include <linux/kernel.h>
|  #include <linux/sched.h>
|  #include <linux/fcntl.h>
| diff --git a/checkpoint/sys.c b/checkpoint/sys.c
| index f26b0c6..d1a5394 100644
| --- a/checkpoint/sys.c
| +++ b/checkpoint/sys.c
| @@ -7,7 +7,7 @@
|   *  License.  See the file COPYING in the main directory of the Linux
|   *  distribution for more details.
|   */
| -
| +#define DEBUG
|  #include <linux/sched.h>
|  #include <linux/nsproxy.h>
|  #include <linux/kernel.h>
| @@ -263,7 +263,6 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned
|  		return PTR_ERR(ctx);
| 
|  	ret = do_checkpoint(ctx, pid);
| -
|  	if (!ret)
|  		ret = ctx->crid;
| 
| @@ -304,3 +303,4 @@ asmlinkage long sys_restart(int crid, int fd, unsigned lon
|  	cr_ctx_put(ctx);
|  	return ret;
|  }
| +
| diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
| index 217cf6e..bc2c202 100644
| --- a/include/linux/checkpoint.h
| +++ b/include/linux/checkpoint.h
| @@ -114,7 +114,6 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_
|  extern int do_restart(struct cr_ctx *ctx, pid_t pid);
|  extern int cr_read_mm(struct cr_ctx *ctx);
|  extern int cr_read_files(struct cr_ctx *ctx);
| -
|  #ifdef pr_fmt
|  #undef pr_fmt
|  #endif
| diff --git a/include/linux/pid.h b/include/linux/pid.h
| index d7e98ff..86e2f61 100644
| --- a/include/linux/pid.h
| +++ b/include/linux/pid.h
| @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
|  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
|  int next_pidmap(struct pid_namespace *pid_ns, int last);
| 
| -extern struct pid *alloc_pid(struct pid_namespace *ns);
| +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr);
|  extern void free_pid(struct pid *pid);
| 
|  /*
| diff --git a/include/linux/sched.h b/include/linux/sched.h
| index 0150e90..7fb4e28 100644
| --- a/include/linux/sched.h
| +++ b/include/linux/sched.h
| @@ -28,6 +28,7 @@
|  #define CLONE_NEWPID		0x20000000	/* New pid namespace */
|  #define CLONE_NEWNET		0x40000000	/* New network namespace */
|  #define CLONE_IO		0x80000000	/* Clone io context */
| +#define CLONE_WITH_PID		0x00001000	/* Clone with pre-select PID */
| 
|  /*
|   * Scheduling policies
| diff --git a/kernel/exit.c b/kernel/exit.c
| index 2d8be7e..4baf651 100644
| --- a/kernel/exit.c
| +++ b/kernel/exit.c
| @@ -3,7 +3,7 @@
|   *
|   *  Copyright (C) 1991, 1992  Linus Torvalds
|   */
| -
| +#define DEBUG
|  #include <linux/mm.h>
|  #include <linux/slab.h>
|  #include <linux/interrupt.h>
| @@ -1676,6 +1676,7 @@ static long do_wait(enum pid_type type, struct pid *pid,
|  	DECLARE_WAITQUEUE(wait, current);
|  	struct task_struct *tsk;
|  	int retval;
| +	int level;
and this (level is not used).
| 
|  	trace_sched_process_wait(pid);
| 
| @@ -1708,7 +1709,6 @@ repeat:
|  			retval = tsk_result;
|  			goto end;
|  		}
| -
|  		if (options & __WNOTHREAD)
|  			break;
|  		tsk = next_thread(tsk);
| @@ -1817,7 +1817,6 @@ asmlinkage long sys_wait4(pid_t upid, int __user *stat_a
|  		type = PIDTYPE_PID;
|  		pid = find_get_pid(upid);
|  	}
| -
|  	ret = do_wait(type, pid, options | WEXITED, NULL, stat_addr, ru);
|  	put_pid(pid);
| 
| diff --git a/kernel/fork.c b/kernel/fork.c
| index 085ce56..262ae1e 100644
| --- a/kernel/fork.c
| +++ b/kernel/fork.c
| @@ -10,7 +10,7 @@
|   * Fork is rather simple, once you get the hang of it, but the memory
|   * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
|   */
| -
| +#define DEBUG
|  #include <linux/slab.h>
|  #include <linux/init.h>
|  #include <linux/unistd.h>
| @@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
|  	int retval;
|  	struct task_struct *p;
|  	int cgroup_callbacks_done = 0;
| +	pid_t clone_pid = stack_size;
| 
|  	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
|  		return ERR_PTR(-EINVAL);
| 
| +	/* We only allow the clone_with_pid when a new pid namespace is
| +	 * created. FIXME: how to restrict it.
Not sure why CLONE_NEWPID is required to set pid_nr. In fact with CLONE_NEWPID,
by definition, pid_nr should be 1. Also, what happens if a container has
more than one process - where the second process has a pid_nr > 2 ?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                                 ` <20090313053458.GA28833-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-03-13  6:19                                   ` Ying Han
  0 siblings, 0 replies; 121+ messages in thread
From: Ying Han @ 2009-03-13  6:19 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Serge E. Hallyn, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mingo-X9Un+BFzKDI,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, Andrew Morton,
	xemul-GEFAQzZX7r8dnm+yROfE0A,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, Alexey Dobriyan
Thank you Sukadev for your comments. I will try to clean up my patch
and repost it.
--Ying
On Thu, Mar 12, 2009 at 10:34 PM, Sukadev Bhattiprolu
<sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> Ying Han [yinghan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org] wrote:
> | Hi Serge:
> | I made a patch based on Oren's tree recently which implement a new
> | syscall clone_with_pid. I tested with checkpoint/restart process tree
> | and it works as expected.
>
> Yes, I think we had a version of clone() with pid a while ago.
>
> But it would be easier to review if you break it up into smaller
> patches. and remove the unnecessary diffs in this patch like...
>
>
> | This patch has some hack in it which i made a copy of libc's clone and
> | made modifications of passing one more argument(pid number). I will
> | try to clean up the code and do more testing.
> |
> | New syscall clone_with_pid
> | Implement a new syscall which clone a thread with a preselected pid number.
> |
> | clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
> |                       CLONE_WITH_PID|SIGCHLD, pid, NULL);
> |
> | Signed-off-by: Ying Han <yinghan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> |
> | diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
> | index 87803da..b5a1b03 100644
> | --- a/arch/x86/include/asm/syscalls.h
> | +++ b/arch/x86/include/asm/syscalls.h
> | @@ -26,6 +26,7 @@ asmlinkage int sys_fork(struct pt_regs);
> |  asmlinkage int sys_clone(struct pt_regs);
> |  asmlinkage int sys_vfork(struct pt_regs);
> |  asmlinkage int sys_execve(struct pt_regs);
> | +asmlinkage int sys_clone_with_pid(struct pt_regs);
> |
> |  /* kernel/signal_32.c */
> |  asmlinkage int sys_sigsuspend(int, int, old_sigset_t);
> | diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32
> | index a5f9e09..f10ca0e 100644
> | --- a/arch/x86/include/asm/unistd_32.h
> | +++ b/arch/x86/include/asm/unistd_32.h
> | @@ -340,6 +340,7 @@
> |  #define __NR_inotify_init1   332
> |  #define __NR_checkpoint              333
> |  #define __NR_restart         334
> | +#define __NR_clone_with_pid  335
> |
> |  #ifdef __KERNEL__
> |
> | diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> | index 0a1302f..88ae634 100644
> | --- a/arch/x86/kernel/process_32.c
> | +++ b/arch/x86/kernel/process_32.c
> | @@ -8,7 +8,6 @@
> |  /*
> |   * This file handles the architecture-dependent parts of process handling..
> |   */
> | -
>
> these
>
> |  #include <stdarg.h>
> |
> |  #include <linux/cpu.h>
> | @@ -652,6 +651,28 @@ asmlinkage int sys_clone(struct pt_regs regs)
> |       return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr);
> |  }
> |
> | +/**
> | + * sys_clone_with_pid - clone a thread with pre-select pid number.
> | + */
> | +asmlinkage int sys_clone_with_pid(struct pt_regs regs)
> | +{
> | +     unsigned long clone_flags;
> | +     unsigned long newsp;
> | +     int __user *parent_tidptr, *child_tidptr;
> | +     pid_t pid_nr;
> | +
> | +     clone_flags = regs.bx;
> | +     newsp = regs.cx;
> | +     parent_tidptr = (int __user *)regs.dx;
> | +     child_tidptr = (int __user *)regs.di;
> | +     pid_nr = regs.bp;
> | +
> | +     if (!newsp)
> | +             newsp = regs.sp;
> | +     return do_fork(clone_flags, newsp, ®s, pid_nr, parent_tidptr,
> | +                     child_tidptr);
> | +}
> | +
> |  /*
> |   * This is trivial, and on the face of it looks like it
> |   * could equally well be done in user mode.
> | diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_tabl
> | index 5543136..5191117 100644
> | --- a/arch/x86/kernel/syscall_table_32.S
> | +++ b/arch/x86/kernel/syscall_table_32.S
> | @@ -334,3 +334,4 @@ ENTRY(sys_call_table)
> |       .long sys_inotify_init1
> |       .long sys_checkpoint
> |       .long sys_restart
> | +     .long sys_clone_with_pid
> | diff --git a/arch/x86/mm/checkpoint.c b/arch/x86/mm/checkpoint.c
> | index 50bde9a..a4aee65 100644
> | --- a/arch/x86/mm/checkpoint.c
> | +++ b/arch/x86/mm/checkpoint.c
> | @@ -7,7 +7,6 @@
> |   *  License.  See the file COPYING in the main directory of the Linux
> |   *  distribution for more details.
> |   */
> | -
> |  #include <asm/desc.h>
> |  #include <asm/i387.h>
> |
> | diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
> | index 64155de..b7de611 100644
> | --- a/checkpoint/checkpoint.c
> | +++ b/checkpoint/checkpoint.c
> | @@ -8,6 +8,7 @@
> |   *  distribution for more details.
> |   */
> |
> | +#define DEBUG
> |  #include <linux/version.h>
> |  #include <linux/sched.h>
> |  #include <linux/ptrace.h>
> | @@ -564,3 +565,4 @@ int do_checkpoint(struct cr_ctx *ctx, pid_t pid)
> |   out:
> |       return ret;
> |  }
> | +
> | diff --git a/checkpoint/ckpt_file.c b/checkpoint/ckpt_file.c
> | index e3097ac..a8c5ad5 100644
> | --- a/checkpoint/ckpt_file.c
> | +++ b/checkpoint/ckpt_file.c
> | @@ -7,7 +7,7 @@
> |   *  License.  See the file COPYING in the main directory of the Linux
> |   *  distribution for more details.
> |   */
> | -
> | +#define DEBUG
> |  #include <linux/kernel.h>
> |  #include <linux/sched.h>
> |  #include <linux/file.h>
> | diff --git a/checkpoint/ckpt_mem.c b/checkpoint/ckpt_mem.c
> | index 4925ff2..ca5840b 100644
> | --- a/checkpoint/ckpt_mem.c
> | +++ b/checkpoint/ckpt_mem.c
> | @@ -7,7 +7,7 @@
> |   *  License.  See the file COPYING in the main directory of the Linux
> |   *  distribution for more details.
> |   */
> | -
> | +#define DEBUG
> |  #include <linux/kernel.h>
> |  #include <linux/sched.h>
> |  #include <linux/slab.h>
> | diff --git a/checkpoint/restart.c b/checkpoint/restart.c
> | index 7ec4de4..30e43c2 100644
> | --- a/checkpoint/restart.c
> | +++ b/checkpoint/restart.c
> | @@ -8,6 +8,7 @@
> |   *  distribution for more details.
> |   */
> |
> | +#define DEBUG
> |  #include <linux/version.h>
> |  #include <linux/sched.h>
> |  #include <linux/wait.h>
> | @@ -242,7 +243,7 @@ static int cr_read_task_struct(struct cr_ctx *ctx)
> |               memcpy(t->comm, buf, min(hh->task_comm_len, TASK_COMM_LEN));
> |       }
> |       kfree(buf);
> | -
> | +     pr_debug("read task %s\n", t->comm);
> |       /* FIXME: restore remaining relevant task_struct fields */
> |   out:
> |       cr_hbuf_put(ctx, sizeof(*hh));
> | diff --git a/checkpoint/rstr_file.c b/checkpoint/rstr_file.c
> | index f44b081..755e40e 100644
> | --- a/checkpoint/rstr_file.c
> | +++ b/checkpoint/rstr_file.c
> | @@ -7,7 +7,7 @@
> |   *  License.  See the file COPYING in the main directory of the Linux
> |   *  distribution for more details.
> |   */
> | -
> | +#define DEBUG
> |  #include <linux/kernel.h>
> |  #include <linux/sched.h>
> |  #include <linux/fs.h>
> | diff --git a/checkpoint/rstr_mem.c b/checkpoint/rstr_mem.c
> | index 4d5ce1a..8330468 100644
> | --- a/checkpoint/rstr_mem.c
> | +++ b/checkpoint/rstr_mem.c
> | @@ -7,7 +7,7 @@
> |   *  License.  See the file COPYING in the main directory of the Linux
> |   *  distribution for more details.
> |   */
> | -
> | +#define DEBUG
> |  #include <linux/kernel.h>
> |  #include <linux/sched.h>
> |  #include <linux/fcntl.h>
> | diff --git a/checkpoint/sys.c b/checkpoint/sys.c
> | index f26b0c6..d1a5394 100644
> | --- a/checkpoint/sys.c
> | +++ b/checkpoint/sys.c
> | @@ -7,7 +7,7 @@
> |   *  License.  See the file COPYING in the main directory of the Linux
> |   *  distribution for more details.
> |   */
> | -
> | +#define DEBUG
> |  #include <linux/sched.h>
> |  #include <linux/nsproxy.h>
> |  #include <linux/kernel.h>
> | @@ -263,7 +263,6 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned
> |               return PTR_ERR(ctx);
> |
> |       ret = do_checkpoint(ctx, pid);
> | -
> |       if (!ret)
> |               ret = ctx->crid;
> |
> | @@ -304,3 +303,4 @@ asmlinkage long sys_restart(int crid, int fd, unsigned lon
> |       cr_ctx_put(ctx);
> |       return ret;
> |  }
> | +
> | diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
> | index 217cf6e..bc2c202 100644
> | --- a/include/linux/checkpoint.h
> | +++ b/include/linux/checkpoint.h
> | @@ -114,7 +114,6 @@ extern int cr_write_files(struct cr_ctx *ctx, struct task_
> |  extern int do_restart(struct cr_ctx *ctx, pid_t pid);
> |  extern int cr_read_mm(struct cr_ctx *ctx);
> |  extern int cr_read_files(struct cr_ctx *ctx);
> | -
> |  #ifdef pr_fmt
> |  #undef pr_fmt
> |  #endif
> | diff --git a/include/linux/pid.h b/include/linux/pid.h
> | index d7e98ff..86e2f61 100644
> | --- a/include/linux/pid.h
> | +++ b/include/linux/pid.h
> | @@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
> |  extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
> |  int next_pidmap(struct pid_namespace *pid_ns, int last);
> |
> | -extern struct pid *alloc_pid(struct pid_namespace *ns);
> | +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr);
> |  extern void free_pid(struct pid *pid);
> |
> |  /*
> | diff --git a/include/linux/sched.h b/include/linux/sched.h
> | index 0150e90..7fb4e28 100644
> | --- a/include/linux/sched.h
> | +++ b/include/linux/sched.h
> | @@ -28,6 +28,7 @@
> |  #define CLONE_NEWPID         0x20000000      /* New pid namespace */
> |  #define CLONE_NEWNET         0x40000000      /* New network namespace */
> |  #define CLONE_IO             0x80000000      /* Clone io context */
> | +#define CLONE_WITH_PID               0x00001000      /* Clone with pre-select PID */
> |
> |  /*
> |   * Scheduling policies
> | diff --git a/kernel/exit.c b/kernel/exit.c
> | index 2d8be7e..4baf651 100644
> | --- a/kernel/exit.c
> | +++ b/kernel/exit.c
> | @@ -3,7 +3,7 @@
> |   *
> |   *  Copyright (C) 1991, 1992  Linus Torvalds
> |   */
> | -
> | +#define DEBUG
> |  #include <linux/mm.h>
> |  #include <linux/slab.h>
> |  #include <linux/interrupt.h>
> | @@ -1676,6 +1676,7 @@ static long do_wait(enum pid_type type, struct pid *pid,
> |       DECLARE_WAITQUEUE(wait, current);
> |       struct task_struct *tsk;
> |       int retval;
> | +     int level;
>
> and this (level is not used).
> |
> |       trace_sched_process_wait(pid);
> |
> | @@ -1708,7 +1709,6 @@ repeat:
> |                       retval = tsk_result;
> |                       goto end;
> |               }
> | -
> |               if (options & __WNOTHREAD)
> |                       break;
> |               tsk = next_thread(tsk);
> | @@ -1817,7 +1817,6 @@ asmlinkage long sys_wait4(pid_t upid, int __user *stat_a
> |               type = PIDTYPE_PID;
> |               pid = find_get_pid(upid);
> |       }
> | -
> |       ret = do_wait(type, pid, options | WEXITED, NULL, stat_addr, ru);
> |       put_pid(pid);
> |
> | diff --git a/kernel/fork.c b/kernel/fork.c
> | index 085ce56..262ae1e 100644
> | --- a/kernel/fork.c
> | +++ b/kernel/fork.c
> | @@ -10,7 +10,7 @@
> |   * Fork is rather simple, once you get the hang of it, but the memory
> |   * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
> |   */
> | -
> | +#define DEBUG
> |  #include <linux/slab.h>
> |  #include <linux/init.h>
> |  #include <linux/unistd.h>
> | @@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
> |       int retval;
> |       struct task_struct *p;
> |       int cgroup_callbacks_done = 0;
> | +     pid_t clone_pid = stack_size;
> |
> |       if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
> |               return ERR_PTR(-EINVAL);
> |
> | +     /* We only allow the clone_with_pid when a new pid namespace is
> | +      * created. FIXME: how to restrict it.
>
> Not sure why CLONE_NEWPID is required to set pid_nr. In fact with CLONE_NEWPID,
> by definition, pid_nr should be 1. Also, what happens if a container has
> more than one process - where the second process has a pid_nr > 2 ?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                               ` <604427e00903122129y37ad791aq5fe7ef2552415da9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-03-13 15:27                                 ` Cedric Le Goater
       [not found]                                   ` <49BA7B60.60607-GANU6spQydw@public.gmane.org>
  0 siblings, 1 reply; 121+ messages in thread
From: Cedric Le Goater @ 2009-03-13 15:27 UTC (permalink / raw)
  To: Ying Han
  Cc: Serge E. Hallyn, Greg Kurz, Andrew Morton,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Alexey Dobriyan, xemul-GEFAQzZX7r8dnm+yROfE0A
[-- Attachment #1: Type: text/plain, Size: 5265 bytes --]
Ying Han wrote:
> Hi Serge:
> I made a patch based on Oren's tree recently which implement a new
> syscall clone_with_pid. I tested with checkpoint/restart process tree
> and it works as expected.
> This patch has some hack in it which i made a copy of libc's clone and
> made modifications of passing one more argument(pid number). I will
> try to clean up the code and do more testing.
ok. 2 patches would also be interesting. one for the syscall and one
for the kernel support (which might be acceptable)
> New syscall clone_with_pid
> Implement a new syscall which clone a thread with a preselected pid number.
yes this definitely needed to restart a task/thread. we maintain an ugly 
hack which stores a pid in the current task that will be used by the next 
clone() call. 
> clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
> 			CLONE_WITH_PID|SIGCHLD, pid, NULL);
since you're introducing a new syscall, I don't see why you need a 
CLONE_WITH_PID flag ?
(FYI, attached is my old attempt of clone_with_pid but that's too old)
[ ... ]
> +#define DEBUG
>  #include <linux/slab.h>
>  #include <linux/init.h>
>  #include <linux/unistd.h>
> @@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
>  	int retval;
>  	struct task_struct *p;
>  	int cgroup_callbacks_done = 0;
> +	pid_t clone_pid = stack_size;
> 
>  	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
>  		return ERR_PTR(-EINVAL);
> 
> +	/* We only allow the clone_with_pid when a new pid namespace is
> +	 * created. FIXME: how to restrict it.
> +	 */
> +	if ((clone_flags & CLONE_NEWPID) && (clone_flags & CLONE_WITH_PID))
> +		return ERR_PTR(-EINVAL);
> +	if ((clone_flags & CLONE_WITH_PID) && (clone_pid <= 1))
> +		return ERR_PTR(-EINVAL);
I would let alloc_pid() handle the error.
>  	/*
>  	 * Thread groups must share signals as well, and detached threads
>  	 * can only be started up within the thread group.
> @@ -1135,7 +1144,10 @@ static struct task_struct *copy_process(unsigned long c
> 
>  	if (pid != &init_struct_pid) {
>  		retval = -ENOMEM;
> -		pid = alloc_pid(task_active_pid_ns(p));
> +		if (clone_flags & CLONE_WITH_PID)
> +			pid = alloc_pid(task_active_pid_ns(p), clone_pid);
> +		else
> +			pid = alloc_pid(task_active_pid_ns(p), 0);
this is overkill IMO.
[ ... ]
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t pid_nr)
>  {
>  	int i, offset, max_scan, pid, last = pid_ns->last_pid;
>  	struct pidmap *map;
> 
> -	pid = last + 1;
> +	if (pid_nr)
> +		pid = pid_nr;
> +	else
> +		pid = last + 1;
>
>  	if (pid >= pid_max)
>  		pid = RESERVED_PIDS;
>  	offset = pid & BITS_PER_PAGE_MASK;
> @@ -153,9 +156,12 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
>  			do {
>  				if (!test_and_set_bit(offset, map->page)) {
>  					atomic_dec(&map->nr_free);
> -					pid_ns->last_pid = pid;
> +					if (!pid_nr)
> +						pid_ns->last_pid = pid;
>  					return pid;
>  				}
> +				if (pid_nr)
> +					return -1;
>  				offset = find_next_offset(map, offset);
>  				pid = mk_pid(pid_ns, map, offset);
>  			/*
> @@ -239,21 +245,25 @@ void free_pid(struct pid *pid)
>  	call_rcu(&pid->rcu, delayed_put_pid);
>  }
> 
> -struct pid *alloc_pid(struct pid_namespace *ns)
> +struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr)
>  {
>  	struct pid *pid;
>  	enum pid_type type;
>  	int i, nr;
>  	struct pid_namespace *tmp;
>  	struct upid *upid;
> +	int level = ns->level;
> +
> +	if (pid_nr >= pid_max)
> +		return NULL;
let alloc_pidmap() handle it ? 
> 
>  	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
>  	if (!pid)
>  		goto out;
> 
> -	tmp = ns;
> -	for (i = ns->level; i >= 0; i--) {
> -		nr = alloc_pidmap(tmp);
> +	tmp = ns->parent;
> +	for (i = level-1; i >= 0; i--) {
> +		nr = alloc_pidmap(tmp, 0);
>  		if (nr < 0)
>  			goto out_free;
> 
> @@ -262,6 +272,14 @@ struct pid *alloc_pid(struct pid_namespace *ns)
>  		tmp = tmp->parent;
>  	}
> 
> +	nr = alloc_pidmap(ns, pid_nr);
> +	if (nr < 0)
> +		goto out_free;
> +	pid->numbers[level].nr = nr;
> +	pid->numbers[level].ns = ns;
> +	if (nr == pid_nr)
> +		pr_debug("nr == pid_nr == %d\n", nr);
> +
>  	get_pid_ns(ns);
>  	pid->level = ns->level;
>  	atomic_set(&pid->count, 1);
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Mar 12, 2009 at 2:21 PM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>> Quoting Greg Kurz (gkurz-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org):
>>> On Thu, 2009-03-12 at 09:53 -0500, Serge E. Hallyn wrote:
>>>> Or are you suggesting that you'll do a dummy clone of (5594,2) so that
>>>> the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
>>>>
>>> Of course not
>> Ok - someone *did* argue that at some point I think...
>>
>>> but one should be able to tell clone() to pick a specific
>>> pid.
>> Can you explain exactly how?  I must be missing something clever.
>>
>> -serge
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
[-- Attachment #2: clone_with_pid.patch --]
[-- Type: text/plain, Size: 7201 bytes --]
Subject: [RFC] forkpid() syscall
From: Cedric Le Goater <clg-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>
let's the user specify a pid to fork and return EBUSY if the pid is
not available.
this patch includes a alloc_pid*() cleanup on the way errors are 
returned that could be pushed to mainline independently.
usage :
    #include <sys/syscall.h>
    #define __NR_forkpid 	324
    static inline int forkpid(int pid)
    {
	  return syscall(__NR_forkpid, pid);
    }
    
caveats : 
	fork oriented, should also cover clone
	i386 only
	does not cover 64 bits clone flags
Signed-off-by: Cedric Le Goater <clg-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>
---
 arch/i386/kernel/process.c       |   15 +++++++++++----
 arch/i386/kernel/syscall_table.S |    1 +
 include/asm-i386/unistd.h        |    3 ++-
 include/linux/pid.h              |    2 +-
 include/linux/sched.h            |    2 +-
 kernel/fork.c                    |    9 +++++----
 kernel/pid.c                     |   28 +++++++++++++++-------------
 7 files changed, 36 insertions(+), 24 deletions(-)
Index: 2.6.22/kernel/pid.c
===================================================================
--- 2.6.22.orig/kernel/pid.c
+++ 2.6.22/kernel/pid.c
@@ -96,12 +96,12 @@ static fastcall void free_pidmap(struct 
 	atomic_inc(&map->nr_free);
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t upid)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
 	struct pidmap *map;
 
-	pid = last + 1;
+	pid = upid ? upid : last + 1;
 	if (pid >= pid_max)
 		pid = RESERVED_PIDS;
 	offset = pid & BITS_PER_PAGE_MASK;
@@ -130,6 +130,8 @@ static int alloc_pidmap(struct pid_names
 					pid_ns->last_pid = pid;
 					return pid;
 				}
+				if (upid)
+					return -EBUSY;
 				offset = find_next_offset(map, offset);
 				pid = mk_pid(pid_ns, map, offset);
 			/*
@@ -153,7 +155,7 @@ static int alloc_pidmap(struct pid_names
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return -EAGAIN;
 }
 
 static int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -203,19 +205,24 @@ fastcall void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(void)
+struct pid *alloc_pid(pid_t upid)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int nr = -1;
 
 	pid = kmem_cache_alloc(pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
-	nr = alloc_pidmap(current->nsproxy->pid_ns);
-	if (nr < 0)
-		goto out_free;
+	nr = alloc_pidmap(current->nsproxy->pid_ns, upid);
+	if (nr < 0) {
+		kmem_cache_free(pid_cachep, pid);
+		pid = ERR_PTR(nr);
+		goto out;
+	}
 
 	atomic_set(&pid->count, 1);
 	pid->nr = nr;
@@ -228,11 +235,6 @@ struct pid *alloc_pid(void)
 
 out:
 	return pid;
-
-out_free:
-	kmem_cache_free(pid_cachep, pid);
-	pid = NULL;
-	goto out;
 }
 
 struct pid * fastcall find_pid(int nr)
Index: 2.6.22/arch/i386/kernel/process.c
===================================================================
--- 2.6.22.orig/arch/i386/kernel/process.c
+++ 2.6.22/arch/i386/kernel/process.c
@@ -355,7 +355,7 @@ int kernel_thread(int (*fn)(void *), voi
 	regs.eflags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2;
 
 	/* Ok, create the new process.. */
-	return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, ®s, 0, NULL, NULL);
+	return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, ®s, 0, NULL, NULL, 0);
 }
 EXPORT_SYMBOL(kernel_thread);
 
@@ -722,9 +722,16 @@ struct task_struct fastcall * __switch_t
 	return prev_p;
 }
 
+asmlinkage int sys_forkpid(struct pt_regs regs)
+{
+	pid_t pid = regs.ebx;
+
+	return do_fork(SIGCHLD, regs.esp, ®s, 0, NULL, NULL, pid);
+}
+
 asmlinkage int sys_fork(struct pt_regs regs)
 {
-	return do_fork(SIGCHLD, regs.esp, ®s, 0, NULL, NULL);
+	return do_fork(SIGCHLD, regs.esp, ®s, 0, NULL, NULL, 0);
 }
 
 asmlinkage int sys_clone(struct pt_regs regs)
@@ -739,7 +746,7 @@ asmlinkage int sys_clone(struct pt_regs 
 	child_tidptr = (int __user *)regs.edi;
 	if (!newsp)
 		newsp = regs.esp;
-	return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr);
+	return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr, 0);
 }
 
 /*
@@ -754,7 +761,7 @@ asmlinkage int sys_clone(struct pt_regs 
  */
 asmlinkage int sys_vfork(struct pt_regs regs)
 {
-	return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, ®s, 0, NULL, NULL);
+	return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, ®s, 0, NULL, NULL, 0);
 }
 
 /*
Index: 2.6.22/arch/i386/kernel/syscall_table.S
===================================================================
--- 2.6.22.orig/arch/i386/kernel/syscall_table.S
+++ 2.6.22/arch/i386/kernel/syscall_table.S
@@ -323,3 +323,4 @@ ENTRY(sys_call_table)
 	.long sys_signalfd
 	.long sys_timerfd
 	.long sys_eventfd
+	.long sys_forkpid
Index: 2.6.22/include/asm-i386/unistd.h
===================================================================
--- 2.6.22.orig/include/asm-i386/unistd.h
+++ 2.6.22/include/asm-i386/unistd.h
@@ -329,10 +329,11 @@
 #define __NR_signalfd		321
 #define __NR_timerfd		322
 #define __NR_eventfd		323
+#define __NR_forkpid		324
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 324
+#define NR_syscalls 325
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
Index: 2.6.22/kernel/fork.c
===================================================================
--- 2.6.22.orig/kernel/fork.c
+++ 2.6.22/kernel/fork.c
@@ -1358,15 +1358,16 @@ long do_fork(unsigned long clone_flags,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      pid_t upid)
 {
 	struct task_struct *p;
 	int trace = 0;
-	struct pid *pid = alloc_pid();
+	struct pid *pid = alloc_pid(upid);
 	long nr;
 
-	if (!pid)
-		return -EAGAIN;
+	if (IS_ERR(pid))
+		return PTR_ERR(pid);
 	nr = pid->nr;
 	if (unlikely(current->ptrace)) {
 		trace = fork_traceflag (clone_flags);
Index: 2.6.22/include/linux/sched.h
===================================================================
--- 2.6.22.orig/include/linux/sched.h
+++ 2.6.22/include/linux/sched.h
@@ -1433,7 +1433,7 @@ extern int allow_signal(int);
 extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
-extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, pid_t);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
Index: 2.6.22/include/linux/pid.h
===================================================================
--- 2.6.22.orig/include/linux/pid.h
+++ 2.6.22/include/linux/pid.h
@@ -95,7 +95,7 @@ extern struct pid *FASTCALL(find_pid(int
 extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr);
 
-extern struct pid *alloc_pid(void);
+extern struct pid *alloc_pid(pid_t upid);
 extern void FASTCALL(free_pid(struct pid *pid));
 
 static inline pid_t pid_nr(struct pid *pid)
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-12 14:53                       ` Serge E. Hallyn
  2009-03-12 21:01                         ` Greg Kurz
@ 2009-03-13 15:47                         ` Cedric Le Goater
  2009-03-13 16:35                           ` Serge E. Hallyn
  1 sibling, 1 reply; 121+ messages in thread
From: Cedric Le Goater @ 2009-03-13 15:47 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Alexey Dobriyan, linux-api, containers, hpa, linux-kernel,
	Dave Hansen, linux-mm, viro, mingo, mpm, tglx, torvalds,
	Andrew Morton, xemul
> No, what you're suggesting does not suffice.
probably. I'm still trying to understand what you mean below :)
Man, I hate these hierarchicals pid_ns. one level would have been enough, 
just one vpid attribute in 'struct pid*'
 
> Call
> (5591,3,1) the task knows as 5591 in the init_pid_ns, 3 in a child pid
> ns, and 1 in grandchild pid_ns created from there.  Now assume we are
> checkpointing tasks T1=(5592,1), and T2=(5594,3,1).
> 
> We don't care about the first number in the tuples, so they will be
> random numbers after the recreate. 
yes.
> But we do care about the second numbers.  
yes very much and we need a way set these numbers in alloc_pid()
> But specifying CLONE_NEWPID while recreating the process tree
> in userspace does not allow you to specify the 3 in (5594,3,1).
I haven't looked closely at hierarchical pid namespaces but as we're
using a an array of pid indexed but the pidns level, i don't see why 
it shouldn't be possible. you might be right.
anyway, I think that some CLONE_NEW* should be forbidden. Daniel should
send soon a little patch for the ns_cgroup restricting the clone flags
being used in a container.
Cheers,
C.
> Or are you suggesting that you'll do a dummy clone of (5594,2) so that
> the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
> 
> -serge
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13 15:47                         ` Cedric Le Goater
@ 2009-03-13 16:35                           ` Serge E. Hallyn
  2009-03-13 16:53                             ` Cedric Le Goater
  0 siblings, 1 reply; 121+ messages in thread
From: Serge E. Hallyn @ 2009-03-13 16:35 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: Alexey Dobriyan, linux-api, containers, hpa, linux-kernel,
	Dave Hansen, linux-mm, viro, mingo, mpm, tglx, torvalds,
	Andrew Morton, xemul
Quoting Cedric Le Goater (legoater@free.fr):
> 
> > No, what you're suggesting does not suffice.
> 
> probably. I'm still trying to understand what you mean below :)
> 
> Man, I hate these hierarchicals pid_ns. one level would have been enough, 
> just one vpid attribute in 'struct pid*'
Well I don't mind - temporarily - saying that nested pid namespaces
are not checkpointable.  It's just that if we're going to need a new
syscall anyway, then why not go ahead and address the whole problem?
It's not hugely more complicated, and seems worth it.
> > Call
> > (5591,3,1) the task knows as 5591 in the init_pid_ns, 3 in a child pid
> > ns, and 1 in grandchild pid_ns created from there.  Now assume we are
> > checkpointing tasks T1=(5592,1), and T2=(5594,3,1).
> > 
> > We don't care about the first number in the tuples, so they will be
> > random numbers after the recreate. 
> 
> yes.
> 
> > But we do care about the second numbers.  
> 
> yes very much and we need a way set these numbers in alloc_pid()
> 
> > But specifying CLONE_NEWPID while recreating the process tree
> > in userspace does not allow you to specify the 3 in (5594,3,1).
> 
> I haven't looked closely at hierarchical pid namespaces but as we're
> using a an array of pid indexed but the pidns level, i don't see why 
> it shouldn't be possible. you might be right.
> 
> anyway, I think that some CLONE_NEW* should be forbidden. Daniel should
> send soon a little patch for the ns_cgroup restricting the clone flags
> being used in a container.
Uh, that feels a bit over the top.  We want to make this
uncheckpointable (if it remains so), not prevent the whole action.
After all I may be running a container which I don't plan on ever
checkpointing, and inside that container running a job which i do
want to migrate.
So depending on if we're doing the Dave or the rest-of-the-world
way :), we either clear_bit(pidns->may_checkpoint) on the parent
pid_ns when a child is created, or we walk every task being
checkpointed and make sure they each are in the same pid_ns.  Doesn't
that suffice?
-serge
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13 16:35                           ` Serge E. Hallyn
@ 2009-03-13 16:53                             ` Cedric Le Goater
  0 siblings, 0 replies; 121+ messages in thread
From: Cedric Le Goater @ 2009-03-13 16:53 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Alexey Dobriyan, linux-api, containers, hpa, linux-kernel,
	Dave Hansen, linux-mm, viro, mingo, mpm, tglx, torvalds,
	Andrew Morton, xemul
Serge E. Hallyn wrote:
> Quoting Cedric Le Goater (legoater@free.fr):
>>> No, what you're suggesting does not suffice.
>> probably. I'm still trying to understand what you mean below :)
>>
>> Man, I hate these hierarchicals pid_ns. one level would have been enough, 
>> just one vpid attribute in 'struct pid*'
> 
> Well I don't mind - temporarily - saying that nested pid namespaces
> are not checkpointable.  It's just that if we're going to need a new
> syscall anyway, then why not go ahead and address the whole problem?
> It's not hugely more complicated, and seems worth it.
yes. agree. there's a thread going on that topic. i'm following it.
[ ... ] 
>> anyway, I think that some CLONE_NEW* should be forbidden. Daniel should
>> send soon a little patch for the ns_cgroup restricting the clone flags
>> being used in a container.
> 
> Uh, that feels a bit over the top.  We want to make this
> uncheckpointable (if it remains so), not prevent the whole action.
> After all I may be running a container which I don't plan on ever
> checkpointing, and inside that container running a job which i do
> want to migrate.
ok. i've been scanning the emails a bit fast. that would be fine 
and useful.
> So depending on if we're doing the Dave or the rest-of-the-world
> way :), we either clear_bit(pidns->may_checkpoint) on the parent
> pid_ns when a child is created, or we walk every task being
> checkpointed and make sure they each are in the same pid_ns.  
> Doesn't that suffice?
yes. this 'may_checkpoint' is a container level info so I wonder 
where you store it. in a cgroup_checkpoint ? sorry for jumping in 
and may be restarting some old topics of discussion.
C.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                                   ` <49BA7B60.60607-GANU6spQydw@public.gmane.org>
@ 2009-03-13 17:11                                     ` Greg Kurz
  0 siblings, 0 replies; 121+ messages in thread
From: Greg Kurz @ 2009-03-13 17:11 UTC (permalink / raw)
  To: Cedric Le Goater
  Cc: mingo-X9Un+BFzKDI, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	Andrew Morton, xemul-GEFAQzZX7r8dnm+yROfE0A,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, Alexey Dobriyan
On Fri, 2009-03-13 at 16:27 +0100, Cedric Le Goater wrote:
> Ying Han wrote:
> > Hi Serge:
> > I made a patch based on Oren's tree recently which implement a new
> > syscall clone_with_pid. I tested with checkpoint/restart process tree
> > and it works as expected.
> > This patch has some hack in it which i made a copy of libc's clone and
> > made modifications of passing one more argument(pid number). I will
> > try to clean up the code and do more testing.
> 
> ok. 2 patches would also be interesting. one for the syscall and one
> for the kernel support (which might be acceptable)
> 
> > New syscall clone_with_pid
> > Implement a new syscall which clone a thread with a preselected pid number.
> 
> yes this definitely needed to restart a task/thread. we maintain an ugly 
> hack which stores a pid in the current task that will be used by the next 
> clone() call. 
> 
That's probably better as you say... but damned, sys_clone() is arch
dependant so much files to patch. :)
> > clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
> > 			CLONE_WITH_PID|SIGCHLD, pid, NULL);
> 
> since you're introducing a new syscall, I don't see why you need a 
> CLONE_WITH_PID flag ?
> 
> (FYI, attached is my old attempt of clone_with_pid but that's too old)
> 
> [ ... ]
> 
> > +#define DEBUG
> >  #include <linux/slab.h>
> >  #include <linux/init.h>
> >  #include <linux/unistd.h>
> > @@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
> >  	int retval;
> >  	struct task_struct *p;
> >  	int cgroup_callbacks_done = 0;
> > +	pid_t clone_pid = stack_size;
> > 
> >  	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
> >  		return ERR_PTR(-EINVAL);
> > 
> > +	/* We only allow the clone_with_pid when a new pid namespace is
> > +	 * created. FIXME: how to restrict it.
> > +	 */
> > +	if ((clone_flags & CLONE_NEWPID) && (clone_flags & CLONE_WITH_PID))
> > +		return ERR_PTR(-EINVAL);
> > +	if ((clone_flags & CLONE_WITH_PID) && (clone_pid <= 1))
> > +		return ERR_PTR(-EINVAL);
> 
> I would let alloc_pid() handle the error.
> 
> >  	/*
> >  	 * Thread groups must share signals as well, and detached threads
> >  	 * can only be started up within the thread group.
> > @@ -1135,7 +1144,10 @@ static struct task_struct *copy_process(unsigned long c
> > 
> >  	if (pid != &init_struct_pid) {
> >  		retval = -ENOMEM;
> > -		pid = alloc_pid(task_active_pid_ns(p));
> > +		if (clone_flags & CLONE_WITH_PID)
> > +			pid = alloc_pid(task_active_pid_ns(p), clone_pid);
> > +		else
> > +			pid = alloc_pid(task_active_pid_ns(p), 0);
> 
> this is overkill IMO.
> 
> [ ... ]
> 
> > -static int alloc_pidmap(struct pid_namespace *pid_ns)
> > +static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t pid_nr)
> >  {
> >  	int i, offset, max_scan, pid, last = pid_ns->last_pid;
> >  	struct pidmap *map;
> > 
> > -	pid = last + 1;
> > +	if (pid_nr)
> > +		pid = pid_nr;
> > +	else
> > +		pid = last + 1;
> >
> >  	if (pid >= pid_max)
> >  		pid = RESERVED_PIDS;
> >  	offset = pid & BITS_PER_PAGE_MASK;
> > @@ -153,9 +156,12 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
> >  			do {
> >  				if (!test_and_set_bit(offset, map->page)) {
> >  					atomic_dec(&map->nr_free);
> > -					pid_ns->last_pid = pid;
> > +					if (!pid_nr)
> > +						pid_ns->last_pid = pid;
> >  					return pid;
> >  				}
> > +				if (pid_nr)
> > +					return -1;
> >  				offset = find_next_offset(map, offset);
> >  				pid = mk_pid(pid_ns, map, offset);
> >  			/*
> > @@ -239,21 +245,25 @@ void free_pid(struct pid *pid)
> >  	call_rcu(&pid->rcu, delayed_put_pid);
> >  }
> > 
> > -struct pid *alloc_pid(struct pid_namespace *ns)
> > +struct pid *alloc_pid(struct pid_namespace *ns, pid_t pid_nr)
> >  {
> >  	struct pid *pid;
> >  	enum pid_type type;
> >  	int i, nr;
> >  	struct pid_namespace *tmp;
> >  	struct upid *upid;
> > +	int level = ns->level;
> > +
> > +	if (pid_nr >= pid_max)
> > +		return NULL;
> 
> let alloc_pidmap() handle it ? 
> 
> > 
> >  	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
> >  	if (!pid)
> >  		goto out;
> > 
> > -	tmp = ns;
> > -	for (i = ns->level; i >= 0; i--) {
> > -		nr = alloc_pidmap(tmp);
> > +	tmp = ns->parent;
> > +	for (i = level-1; i >= 0; i--) {
> > +		nr = alloc_pidmap(tmp, 0);
> >  		if (nr < 0)
> >  			goto out_free;
> > 
> > @@ -262,6 +272,14 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> >  		tmp = tmp->parent;
> >  	}
> > 
> > +	nr = alloc_pidmap(ns, pid_nr);
> > +	if (nr < 0)
> > +		goto out_free;
> > +	pid->numbers[level].nr = nr;
> > +	pid->numbers[level].ns = ns;
> > +	if (nr == pid_nr)
> > +		pr_debug("nr == pid_nr == %d\n", nr);
> > +
> >  	get_pid_ns(ns);
> >  	pid->level = ns->level;
> >  	atomic_set(&pid->count, 1);
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Thu, Mar 12, 2009 at 2:21 PM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> >> Quoting Greg Kurz (gkurz-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org):
> >>> On Thu, 2009-03-12 at 09:53 -0500, Serge E. Hallyn wrote:
> >>>> Or are you suggesting that you'll do a dummy clone of (5594,2) so that
> >>>> the next clone(CLONE_NEWPID) will be expected to be (5594,3,1)?
> >>>>
> >>> Of course not
> >> Ok - someone *did* argue that at some point I think...
> >>
> >>> but one should be able to tell clone() to pick a specific
> >>> pid.
> >> Can you explain exactly how?  I must be missing something clever.
> >>
> >> -serge
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
> 
> plain text document attachment (clone_with_pid.patch)
> Subject: [RFC] forkpid() syscall
> 
> From: Cedric Le Goater <clg-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>
> 
> let's the user specify a pid to fork and return EBUSY if the pid is
> not available.
> 
> this patch includes a alloc_pid*() cleanup on the way errors are 
> returned that could be pushed to mainline independently.
> 
> usage :
> 
>     #include <sys/syscall.h>
> 
>     #define __NR_forkpid 	324
> 
>     static inline int forkpid(int pid)
>     {
> 	  return syscall(__NR_forkpid, pid);
>     }
>     
> caveats : 
> 	fork oriented, should also cover clone
> 	i386 only
> 	does not cover 64 bits clone flags
> 
> 
> Signed-off-by: Cedric Le Goater <clg-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>
> ---
>  arch/i386/kernel/process.c       |   15 +++++++++++----
>  arch/i386/kernel/syscall_table.S |    1 +
>  include/asm-i386/unistd.h        |    3 ++-
>  include/linux/pid.h              |    2 +-
>  include/linux/sched.h            |    2 +-
>  kernel/fork.c                    |    9 +++++----
>  kernel/pid.c                     |   28 +++++++++++++++-------------
>  7 files changed, 36 insertions(+), 24 deletions(-)
> 
> Index: 2.6.22/kernel/pid.c
> ===================================================================
> --- 2.6.22.orig/kernel/pid.c
> +++ 2.6.22/kernel/pid.c
> @@ -96,12 +96,12 @@ static fastcall void free_pidmap(struct 
>  	atomic_inc(&map->nr_free);
>  }
> 
> -static int alloc_pidmap(struct pid_namespace *pid_ns)
> +static int alloc_pidmap(struct pid_namespace *pid_ns, pid_t upid)
>  {
>  	int i, offset, max_scan, pid, last = pid_ns->last_pid;
>  	struct pidmap *map;
> 
> -	pid = last + 1;
> +	pid = upid ? upid : last + 1;
>  	if (pid >= pid_max)
>  		pid = RESERVED_PIDS;
>  	offset = pid & BITS_PER_PAGE_MASK;
> @@ -130,6 +130,8 @@ static int alloc_pidmap(struct pid_names
>  					pid_ns->last_pid = pid;
>  					return pid;
>  				}
> +				if (upid)
> +					return -EBUSY;
>  				offset = find_next_offset(map, offset);
>  				pid = mk_pid(pid_ns, map, offset);
>  			/*
> @@ -153,7 +155,7 @@ static int alloc_pidmap(struct pid_names
>  		}
>  		pid = mk_pid(pid_ns, map, offset);
>  	}
> -	return -1;
> +	return -EAGAIN;
>  }
> 
>  static int next_pidmap(struct pid_namespace *pid_ns, int last)
> @@ -203,19 +205,24 @@ fastcall void free_pid(struct pid *pid)
>  	call_rcu(&pid->rcu, delayed_put_pid);
>  }
> 
> -struct pid *alloc_pid(void)
> +struct pid *alloc_pid(pid_t upid)
>  {
>  	struct pid *pid;
>  	enum pid_type type;
>  	int nr = -1;
> 
>  	pid = kmem_cache_alloc(pid_cachep, GFP_KERNEL);
> -	if (!pid)
> +	if (!pid) {
> +		pid = ERR_PTR(-ENOMEM);
>  		goto out;
> +	}
> 
> -	nr = alloc_pidmap(current->nsproxy->pid_ns);
> -	if (nr < 0)
> -		goto out_free;
> +	nr = alloc_pidmap(current->nsproxy->pid_ns, upid);
> +	if (nr < 0) {
> +		kmem_cache_free(pid_cachep, pid);
> +		pid = ERR_PTR(nr);
> +		goto out;
> +	}
> 
>  	atomic_set(&pid->count, 1);
>  	pid->nr = nr;
> @@ -228,11 +235,6 @@ struct pid *alloc_pid(void)
> 
>  out:
>  	return pid;
> -
> -out_free:
> -	kmem_cache_free(pid_cachep, pid);
> -	pid = NULL;
> -	goto out;
>  }
> 
>  struct pid * fastcall find_pid(int nr)
> Index: 2.6.22/arch/i386/kernel/process.c
> ===================================================================
> --- 2.6.22.orig/arch/i386/kernel/process.c
> +++ 2.6.22/arch/i386/kernel/process.c
> @@ -355,7 +355,7 @@ int kernel_thread(int (*fn)(void *), voi
>  	regs.eflags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2;
> 
>  	/* Ok, create the new process.. */
> -	return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, ®s, 0, NULL, NULL);
> +	return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, ®s, 0, NULL, NULL, 0);
>  }
>  EXPORT_SYMBOL(kernel_thread);
> 
> @@ -722,9 +722,16 @@ struct task_struct fastcall * __switch_t
>  	return prev_p;
>  }
> 
> +asmlinkage int sys_forkpid(struct pt_regs regs)
> +{
> +	pid_t pid = regs.ebx;
> +
> +	return do_fork(SIGCHLD, regs.esp, ®s, 0, NULL, NULL, pid);
> +}
> +
>  asmlinkage int sys_fork(struct pt_regs regs)
>  {
> -	return do_fork(SIGCHLD, regs.esp, ®s, 0, NULL, NULL);
> +	return do_fork(SIGCHLD, regs.esp, ®s, 0, NULL, NULL, 0);
>  }
> 
>  asmlinkage int sys_clone(struct pt_regs regs)
> @@ -739,7 +746,7 @@ asmlinkage int sys_clone(struct pt_regs 
>  	child_tidptr = (int __user *)regs.edi;
>  	if (!newsp)
>  		newsp = regs.esp;
> -	return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr);
> +	return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr, 0);
>  }
> 
>  /*
> @@ -754,7 +761,7 @@ asmlinkage int sys_clone(struct pt_regs 
>   */
>  asmlinkage int sys_vfork(struct pt_regs regs)
>  {
> -	return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, ®s, 0, NULL, NULL);
> +	return do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, regs.esp, ®s, 0, NULL, NULL, 0);
>  }
> 
>  /*
> Index: 2.6.22/arch/i386/kernel/syscall_table.S
> ===================================================================
> --- 2.6.22.orig/arch/i386/kernel/syscall_table.S
> +++ 2.6.22/arch/i386/kernel/syscall_table.S
> @@ -323,3 +323,4 @@ ENTRY(sys_call_table)
>  	.long sys_signalfd
>  	.long sys_timerfd
>  	.long sys_eventfd
> +	.long sys_forkpid
> Index: 2.6.22/include/asm-i386/unistd.h
> ===================================================================
> --- 2.6.22.orig/include/asm-i386/unistd.h
> +++ 2.6.22/include/asm-i386/unistd.h
> @@ -329,10 +329,11 @@
>  #define __NR_signalfd		321
>  #define __NR_timerfd		322
>  #define __NR_eventfd		323
> +#define __NR_forkpid		324
> 
>  #ifdef __KERNEL__
> 
> -#define NR_syscalls 324
> +#define NR_syscalls 325
> 
>  #define __ARCH_WANT_IPC_PARSE_VERSION
>  #define __ARCH_WANT_OLD_READDIR
> Index: 2.6.22/kernel/fork.c
> ===================================================================
> --- 2.6.22.orig/kernel/fork.c
> +++ 2.6.22/kernel/fork.c
> @@ -1358,15 +1358,16 @@ long do_fork(unsigned long clone_flags,
>  	      struct pt_regs *regs,
>  	      unsigned long stack_size,
>  	      int __user *parent_tidptr,
> -	      int __user *child_tidptr)
> +	      int __user *child_tidptr,
> +	      pid_t upid)
>  {
>  	struct task_struct *p;
>  	int trace = 0;
> -	struct pid *pid = alloc_pid();
> +	struct pid *pid = alloc_pid(upid);
>  	long nr;
> 
> -	if (!pid)
> -		return -EAGAIN;
> +	if (IS_ERR(pid))
> +		return PTR_ERR(pid);
>  	nr = pid->nr;
>  	if (unlikely(current->ptrace)) {
>  		trace = fork_traceflag (clone_flags);
> Index: 2.6.22/include/linux/sched.h
> ===================================================================
> --- 2.6.22.orig/include/linux/sched.h
> +++ 2.6.22/include/linux/sched.h
> @@ -1433,7 +1433,7 @@ extern int allow_signal(int);
>  extern int disallow_signal(int);
> 
>  extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
> -extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
> +extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *, pid_t);
>  struct task_struct *fork_idle(int);
> 
>  extern void set_task_comm(struct task_struct *tsk, char *from);
> Index: 2.6.22/include/linux/pid.h
> ===================================================================
> --- 2.6.22.orig/include/linux/pid.h
> +++ 2.6.22/include/linux/pid.h
> @@ -95,7 +95,7 @@ extern struct pid *FASTCALL(find_pid(int
>  extern struct pid *find_get_pid(int nr);
>  extern struct pid *find_ge_pid(int nr);
> 
> -extern struct pid *alloc_pid(void);
> +extern struct pid *alloc_pid(pid_t upid);
>  extern void FASTCALL(free_pid(struct pid *pid));
> 
>  static inline pid_t pid_nr(struct pid *pid)
-- 
Gregory Kurz                                     gkurz-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org
Software Engineer @ IBM/Meiosys                  http://www.ibm.com
Tel +33 (0)534 638 479                           Fax +33 (0)561 400 420
"Anarchy is about taking complete responsibility for yourself."
        Alan Moore.
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13  5:34                               ` Sukadev Bhattiprolu
       [not found]                                 ` <20090313053458.GA28833-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-03-13 17:27                                 ` Linus Torvalds
  2009-03-13 19:02                                   ` Serge E. Hallyn
                                                     ` (2 more replies)
  1 sibling, 3 replies; 121+ messages in thread
From: Linus Torvalds @ 2009-03-13 17:27 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Ying Han, Serge E. Hallyn, linux-api, containers, hpa,
	linux-kernel, Dave Hansen, linux-mm, viro, mingo, mpm,
	Andrew Morton, xemul, tglx, Alexey Dobriyan
On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
> Ying Han [yinghan@google.com] wrote:
> | Hi Serge:
> | I made a patch based on Oren's tree recently which implement a new
> | syscall clone_with_pid. I tested with checkpoint/restart process tree
> | and it works as expected.
> 
> Yes, I think we had a version of clone() with pid a while ago.
Are people _at_all_ thinking about security?
Obviously not.
There's no way we can do anything like this. Sure, it's trivial to do 
inside the kernel. But it also sounds like a _wonderful_ attack vector 
against badly written user-land software that sends signals and has small 
races.
Quite frankly, from having followed the discussion(s) over the last few 
weeks about checkpoint/restart in various forms, my reaction to just about 
_all_ of this is that people pushing this are pretty damn borderline. 
I think you guys are working on all the wrong problems. 
Let's face it, we're not going to _ever_ checkpoint any kind of general 
case process. Just TCP makes that fundamentally impossible in the general 
case, and there are lots and lots of other cases too (just something as 
totally _trivial_ as all the files in the filesystem that don't get rolled 
back).
So unless people start realizing that
 (a) processes that want to be checkpointed had better be ready and aware 
     of it, and help out
 (b) there's no way in hell that we're going to add these kinds of 
     interfaces that have dubious upsides (just teach the damn program 
     you're checkpointing that pids will change, and admit to everybody 
     that people who want to be checkpointed need to do work) and are 
     potential security holes.
 (c) if you are going to play any deeper games, you need to have 
     privileges. IOW, "clone_with_pid()" is ok for _root_, but not for 
     some random user. And you'd better keep that in mind EVERY SINGLE 
     STEP OF THE WAY.
I'm really fed up with these discussions. I have seen almost _zero_ 
critical thinking at all. Probably because anybody who is in the least 
doubtful about it simply has tuned out the discussion. So here's my input: 
start small, start over, and start thinking about other issues than just 
checkpointing.
		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13  4:29                             ` Ying Han
  2009-03-13  5:34                               ` Sukadev Bhattiprolu
       [not found]                               ` <604427e00903122129y37ad791aq5fe7ef2552415da9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-03-13 17:37                               ` Serge E. Hallyn
  2 siblings, 0 replies; 121+ messages in thread
From: Serge E. Hallyn @ 2009-03-13 17:37 UTC (permalink / raw)
  To: Ying Han
  Cc: Greg Kurz, Cedric Le Goater, Andrew Morton, linux-api, containers,
	mpm, linux-kernel, Dave Hansen, linux-mm, tglx, viro, hpa, mingo,
	torvalds, Alexey Dobriyan, xemul
Quoting Ying Han (yinghan@google.com):
> Hi Serge:
> I made a patch based on Oren's tree recently which implement a new
> syscall clone_with_pid. I tested with checkpoint/restart process tree
> and it works as expected.
> This patch has some hack in it which i made a copy of libc's clone and
> made modifications of passing one more argument(pid number). I will
> try to clean up the code and do more testing.
> 
> New syscall clone_with_pid
> Implement a new syscall which clone a thread with a preselected pid number.
> 
> clone_with_pid(child_func, child_stack + CHILD_STACK - 16,
> 			CLONE_WITH_PID|SIGCHLD, pid, NULL);
> 
> Signed-off-by: Ying Han <yinghan@google.com>
> 
> diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
> index 87803da..b5a1b03 100644
> --- a/arch/x86/include/asm/syscalls.h
> +++ b/arch/x86/include/asm/syscalls.h
> @@ -26,6 +26,7 @@ asmlinkage int sys_fork(struct pt_regs);
>  asmlinkage int sys_clone(struct pt_regs);
>  asmlinkage int sys_vfork(struct pt_regs);
>  asmlinkage int sys_execve(struct pt_regs);
> +asmlinkage int sys_clone_with_pid(struct pt_regs);
> 
>  /* kernel/signal_32.c */
>  asmlinkage int sys_sigsuspend(int, int, old_sigset_t);
> diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32
> index a5f9e09..f10ca0e 100644
> --- a/arch/x86/include/asm/unistd_32.h
> +++ b/arch/x86/include/asm/unistd_32.h
> @@ -340,6 +340,7 @@
>  #define __NR_inotify_init1	332
>  #define __NR_checkpoint		333
>  #define __NR_restart		334
> +#define __NR_clone_with_pid	335
> 
>  #ifdef __KERNEL__
> 
> diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
> index 0a1302f..88ae634 100644
> --- a/arch/x86/kernel/process_32.c
> +++ b/arch/x86/kernel/process_32.c
> @@ -8,7 +8,6 @@
>  /*
>   * This file handles the architecture-dependent parts of process handling..
>   */
> -
>  #include <stdarg.h>
> 
>  #include <linux/cpu.h>
> @@ -652,6 +651,28 @@ asmlinkage int sys_clone(struct pt_regs regs)
>  	return do_fork(clone_flags, newsp, ®s, 0, parent_tidptr, child_tidptr);
>  }
> 
> +/**
> + * sys_clone_with_pid - clone a thread with pre-select pid number.
> + */
> +asmlinkage int sys_clone_with_pid(struct pt_regs regs)
> +{
> +	unsigned long clone_flags;
> +	unsigned long newsp;
> +	int __user *parent_tidptr, *child_tidptr;
> +	pid_t pid_nr;
> +
> +	clone_flags = regs.bx;
> +	newsp = regs.cx;
> +	parent_tidptr = (int __user *)regs.dx;
> +	child_tidptr = (int __user *)regs.di;
> +	pid_nr = regs.bp;
Hi,
Thanks for looking at this.  I appreciate the patch.  Two comments
however.
As I was saying in another email, i think that so long as we are going
with a new syscall, we should make sure that it suffices for nested
pid namespaces.  So send in an array of pids and its lengths, then
use an algorithm like Alexey's to fill in the sent-in pids if possible.
> +	if (!newsp)
> +		newsp = regs.sp;
> +	return do_fork(clone_flags, newsp, ®s, pid_nr, parent_tidptr,
> +			child_tidptr);
> +}
> +
>  /*
>   * This is trivial, and on the face of it looks like it
>   * could equally well be done in user mode.
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 085ce56..262ae1e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -10,7 +10,7 @@
>   * Fork is rather simple, once you get the hang of it, but the memory
>   * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
>   */
> -
> +#define DEBUG
>  #include <linux/slab.h>
>  #include <linux/init.h>
>  #include <linux/unistd.h>
> @@ -959,10 +959,19 @@ static struct task_struct *copy_process(unsigned long cl
>  	int retval;
>  	struct task_struct *p;
>  	int cgroup_callbacks_done = 0;
> +	pid_t clone_pid = stack_size;
Note that some architectures (i.e. ia64) actually use the stack_size
sent to copy_thread, so you at least need to zero out stack_size here.
And I suspect there are cases where a stack_size is actually sent in,
so this doesn't seem legitimate, but I haven't tracked down the callers.
If you tell me you think there should be no case where a real stack_size
is sent in, well, I'll feel better if you prove it by breaking the patch
up so that:
1. you remove the stack_size argument from copy_process.  Test such a
kernel on some architectures (boot+ltp, for instance).
2. add a chosen_pid argument to copy_process.
Then we can be sure noone is using the field.
thanks,
-serge
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13 17:27                                 ` Linus Torvalds
@ 2009-03-13 19:02                                   ` Serge E. Hallyn
       [not found]                                   ` <alpine.LFD.2.00.0903131018390.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2009-03-13 20:48                                   ` Mike Waychison
  2 siblings, 0 replies; 121+ messages in thread
From: Serge E. Hallyn @ 2009-03-13 19:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sukadev Bhattiprolu, Ying Han, linux-api, containers, hpa,
	linux-kernel, Dave Hansen, linux-mm, viro, mingo, mpm,
	Andrew Morton, xemul, tglx, Alexey Dobriyan
Quoting Linus Torvalds (torvalds@linux-foundation.org):
> 
> 
> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
> 
> > Ying Han [yinghan@google.com] wrote:
> > | Hi Serge:
> > | I made a patch based on Oren's tree recently which implement a new
> > | syscall clone_with_pid. I tested with checkpoint/restart process tree
> > | and it works as expected.
> > 
> > Yes, I think we had a version of clone() with pid a while ago.
> 
> Are people _at_all_ thinking about security?
> 
> Obviously not.
> 
> There's no way we can do anything like this. Sure, it's trivial to do 
> inside the kernel. But it also sounds like a _wonderful_ attack vector 
> against badly written user-land software that sends signals and has small 
> races.
If we're worried about that, one way we could address it is to tag a
pid_ns with the userid of whoever created it, and enforce that you
can only specify a pid in a pid_ns which you own.
What openvz does is have sys_restart create the whole process tree,
including custom pids (only in the new private namespaces), from inside
the kernel.  That has the same effect of only allowing specification of
pids in your own private pid namespaces.
> Quite frankly, from having followed the discussion(s) over the last few 
> weeks about checkpoint/restart in various forms, my reaction to just about 
> _all_ of this is that people pushing this are pretty damn borderline. 
> 
> I think you guys are working on all the wrong problems. 
> 
> Let's face it, we're not going to _ever_ checkpoint any kind of general 
> case process. Just TCP makes that fundamentally impossible in the general 
> case, and there are lots and lots of other cases too (just something as 
> totally _trivial_ as all the files in the filesystem that don't get rolled 
> back).
I'm pretty sure that each of Openvz, Metacluster, and Zap are able to
checkpoint, restart, and migrate tasks which are actively using TCP.
Of course doing rollback of one endpoint of an ongoing communication
would be silly, but migration is possible.
> So unless people start realizing that
>  (a) processes that want to be checkpointed had better be ready and aware 
>      of it, and help out
>  (b) there's no way in hell that we're going to add these kinds of 
>      interfaces that have dubious upsides (just teach the damn program 
>      you're checkpointing that pids will change, and admit to everybody 
>      that people who want to be checkpointed need to do work) and are 
>      potential security holes.
>  (c) if you are going to play any deeper games, you need to have 
>      privileges. IOW, "clone_with_pid()" is ok for _root_, but not for 
>      some random user. And you'd better keep that in mind EVERY SINGLE 
>      STEP OF THE WAY.
Yes, that is why we're keeping from requiring privilege so far - to make
sure that at each step we have to consider security requirements.
> I'm really fed up with these discussions. I have seen almost _zero_ 
> critical thinking at all. Probably because anybody who is in the least 
> doubtful about it simply has tuned out the discussion. So here's my input: 
> start small, start over, and start thinking about other issues than just 
> checkpointing.
The first set of patches from Oren is intended to do just that.  It
certainly did not have any sort of clone_with_pid() equivalent.
-serge
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                                   ` <alpine.LFD.2.00.0903131018390.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2009-03-13 19:35                                     ` Alexey Dobriyan
  2009-03-13 21:01                                       ` Linus Torvalds
  0 siblings, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-03-13 19:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sukadev Bhattiprolu, Ying Han, Serge E. Hallyn,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mingo-X9Un+BFzKDI,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, Andrew Morton,
	xemul-GEFAQzZX7r8dnm+yROfE0A, tglx-hfZtesqFncYOwBW4kG4KsQ
On Fri, Mar 13, 2009 at 10:27:54AM -0700, Linus Torvalds wrote:
> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
> 
> > Ying Han [yinghan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org] wrote:
> > | Hi Serge:
> > | I made a patch based on Oren's tree recently which implement a new
> > | syscall clone_with_pid. I tested with checkpoint/restart process tree
> > | and it works as expected.
> > 
> > Yes, I think we had a version of clone() with pid a while ago.
> 
> Are people _at_all_ thinking about security?
> 
> Obviously not.
For the record, OpenVZ always have CAP_SYS_ADMIN check on restore.
And CAP_SYS_ADMIN will be in version to be sent out.
Not having it is one big security hole.
> There's no way we can do anything like this. Sure, it's trivial to do 
> inside the kernel. But it also sounds like a _wonderful_ attack vector 
> against badly written user-land software that sends signals and has small 
> races.
> 
> Quite frankly, from having followed the discussion(s) over the last few 
> weeks about checkpoint/restart in various forms, my reaction to just about 
> _all_ of this is that people pushing this are pretty damn borderline. 
> 
> I think you guys are working on all the wrong problems. 
> 
> Let's face it, we're not going to _ever_ checkpoint any kind of general 
> case process. Just TCP makes that fundamentally impossible in the general 
> case, and there are lots and lots of other cases too (just something as 
> totally _trivial_ as all the files in the filesystem that don't get rolled 
> back).
What do you mean here? Unlinked files?
> So unless people start realizing that
>  (a) processes that want to be checkpointed had better be ready and aware 
>      of it, and help out
This is not going to happen. Userspace authors won't do anything
(nor they shouldn't).
>  (b) there's no way in hell that we're going to add these kinds of 
>      interfaces that have dubious upsides (just teach the damn program 
>      you're checkpointing that pids will change, and admit to everybody 
>      that people who want to be checkpointed need to do work) and are 
>      potential security holes.
I personally don't understand why on earth clone_with_pid() is again
with us.
As if pids are somehow unique among other resources.
It was discussed when IPC objects creation with specific parameters were
discussed.
"struct pid" and "struct pid_namespace" can be trivially restored
without leaking to userspace.
People probably assume that task should be restored with clone(2) which
is unnatural given relations between task_struct, nsproxy and individual
struct foo_namespace's
>  (c) if you are going to play any deeper games, you need to have 
>      privileges. IOW, "clone_with_pid()" is ok for _root_, but not for 
>      some random user. And you'd better keep that in mind EVERY SINGLE 
>      STEP OF THE WAY.
> 
> I'm really fed up with these discussions. I have seen almost _zero_ 
> critical thinking at all. Probably because anybody who is in the least 
> doubtful about it simply has tuned out the discussion. So here's my input: 
> start small, start over, and start thinking about other issues than just 
> checkpointing.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13 17:27                                 ` Linus Torvalds
  2009-03-13 19:02                                   ` Serge E. Hallyn
       [not found]                                   ` <alpine.LFD.2.00.0903131018390.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2009-03-13 20:48                                   ` Mike Waychison
  2009-03-13 22:35                                     ` Oren Laadan
  2 siblings, 1 reply; 121+ messages in thread
From: Mike Waychison @ 2009-03-13 20:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sukadev Bhattiprolu, Alexey Dobriyan, linux-api, containers, mpm,
	linux-kernel, Dave Hansen, linux-mm, tglx, viro, hpa, mingo,
	Andrew Morton, xemul
Linus Torvalds wrote:
> 
> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
> 
>> Ying Han [yinghan@google.com] wrote:
>> | Hi Serge:
>> | I made a patch based on Oren's tree recently which implement a new
>> | syscall clone_with_pid. I tested with checkpoint/restart process tree
>> | and it works as expected.
>>
>> Yes, I think we had a version of clone() with pid a while ago.
> 
> Are people _at_all_ thinking about security?
> 
> Obviously not.
> 
> There's no way we can do anything like this. Sure, it's trivial to do 
> inside the kernel. But it also sounds like a _wonderful_ attack vector 
> against badly written user-land software that sends signals and has small 
> races.
I'm not really sure how this is different than a malicious app going off 
and spawning thousands of threads in an attempt to hit a target pid from 
a security pov.  Sure, it makes it easier, but it's not like there is 
anything in place to close the attack vector.
> 
> Quite frankly, from having followed the discussion(s) over the last few 
> weeks about checkpoint/restart in various forms, my reaction to just about 
> _all_ of this is that people pushing this are pretty damn borderline. 
> 
> I think you guys are working on all the wrong problems. 
> 
> Let's face it, we're not going to _ever_ checkpoint any kind of general 
> case process. Just TCP makes that fundamentally impossible in the general 
> case, and there are lots and lots of other cases too (just something as 
> totally _trivial_ as all the files in the filesystem that don't get rolled 
> back).
In some instances such as ours, TCP is probably the easiest thing to 
migrate.  In an rpc-based cluster application, TCP is nothing more than 
an RPC channel and applications already have to handle RPC channel 
failure and re-establishment.
I agree that this is not the 'general case' as you mention above 
however.  This is the bit that sorta bothers me with the way the 
implementation has been going so far on this list.  The implementation 
that folks are building on top of Oren's patchset tries to be everything 
to everybody.  For our purposes, we need to have the flexibility of 
choosing *how* we checkpoint.  The line seems to be arbitrarily drawn at 
the kernel being responsible for checkpointing and restoring all 
resources associated with a task, and leaving userland with nothing more 
than transporting filesystem bits.  This approach isn't flexible enough: 
  Consider the case where we want to stub out most of the TCP file 
descriptors with ECONNRESETed sockets because we know that they are RPC 
sockets and can re-establish themselves, but we want to use some other 
mechanism for TCP sockets we don't know much about.  The current 
monolithic approach has zero flexibility for doing anything like this, 
and I figure out how we could even fit anything like this in.
This sort of problem is pushing me to wanting all this stuff to live up 
in userland.  The 'core dump'ish way of checkpointing is a great way to 
prototype some of the requirements, but it's going to end up being 
pretty difficult to do anything interesting long term and this is going 
to stifle any chance of this getting productized in our environments.
> 
> So unless people start realizing that
>  (a) processes that want to be checkpointed had better be ready and aware 
>      of it, and help out
>  (b) there's no way in hell that we're going to add these kinds of 
>      interfaces that have dubious upsides (just teach the damn program 
>      you're checkpointing that pids will change, and admit to everybody 
>      that people who want to be checkpointed need to do work) and are 
>      potential security holes.
This is a bit ridiculous.  This is akin to asking programs to recognize 
that their heap addresses may change.
>  (c) if you are going to play any deeper games, you need to have 
>      privileges. IOW, "clone_with_pid()" is ok for _root_, but not for 
>      some random user. And you'd better keep that in mind EVERY SINGLE 
>      STEP OF THE WAY.
> 
> I'm really fed up with these discussions. I have seen almost _zero_ 
> critical thinking at all. Probably because anybody who is in the least 
> doubtful about it simply has tuned out the discussion. So here's my input: 
> start small, start over, and start thinking about other issues than just 
> checkpointing.
> 
> 		Linus
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13 19:35                                     ` Alexey Dobriyan
@ 2009-03-13 21:01                                       ` Linus Torvalds
  2009-03-13 21:51                                         ` Dave Hansen
       [not found]                                         ` <alpine.LFD.2.00.0903131401070.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  0 siblings, 2 replies; 121+ messages in thread
From: Linus Torvalds @ 2009-03-13 21:01 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Sukadev Bhattiprolu, Ying Han, Serge E. Hallyn, linux-api,
	containers, hpa, linux-kernel, Dave Hansen, linux-mm, viro, mingo,
	mpm, Andrew Morton, xemul, tglx
On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
> > 
> > Let's face it, we're not going to _ever_ checkpoint any kind of general 
> > case process. Just TCP makes that fundamentally impossible in the general 
> > case, and there are lots and lots of other cases too (just something as 
> > totally _trivial_ as all the files in the filesystem that don't get rolled 
> > back).
> 
> What do you mean here? Unlinked files?
Or modified files, or anything else. "External state" is a pretty damn 
wide net. It's not just TCP sequence numbers and another machine.
		Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13 21:01                                       ` Linus Torvalds
@ 2009-03-13 21:51                                         ` Dave Hansen
  2009-03-13 22:15                                           ` Oren Laadan
       [not found]                                         ` <alpine.LFD.2.00.0903131401070.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  1 sibling, 1 reply; 121+ messages in thread
From: Dave Hansen @ 2009-03-13 21:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexey Dobriyan, Sukadev Bhattiprolu, Ying Han, Serge E. Hallyn,
	linux-api, containers, hpa, linux-kernel, linux-mm, viro, mingo,
	mpm, Andrew Morton, xemul, tglx
On Fri, 2009-03-13 at 14:01 -0700, Linus Torvalds wrote:
> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
> > > Let's face it, we're not going to _ever_ checkpoint any kind of general 
> > > case process. Just TCP makes that fundamentally impossible in the general 
> > > case, and there are lots and lots of other cases too (just something as 
> > > totally _trivial_ as all the files in the filesystem that don't get rolled 
> > > back).
> > 
> > What do you mean here? Unlinked files?
> 
> Or modified files, or anything else. "External state" is a pretty damn 
> wide net. It's not just TCP sequence numbers and another machine.
This is precisely the reason that we've focused so hard on containers,
and *didn't* just jump right into checkpoint/restart; we're trying
really hard to constrain the _truly_ external things that a process can
interact with.  
The approach so far has largely been to make things are external to a
process at least *internal* to a container.  Network, pid, ipc, and uts
namespaces, for example.  An ipc/sem.c semaphore may be external to a
process, so we'll just pick the whole namespace up and checkpoint it
along with the process.
In the OpenVZ case, they've at least demonstrated that the filesystem
can be moved largely with rsync.  Unlinked files need some in-kernel TLC
(or /proc mangling) but it isn't *that* bad.
We can also make the fs problem much easier by using things like dm or
btrfs snapshotting of the block device, or restricting to where on a fs
a container is allowed to write with stuff like r/o bind mounts.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13 21:51                                         ` Dave Hansen
@ 2009-03-13 22:15                                           ` Oren Laadan
  2009-03-14  0:27                                             ` Eric W. Biederman
  0 siblings, 1 reply; 121+ messages in thread
From: Oren Laadan @ 2009-03-13 22:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Andrew Morton, linux-api, containers, mpm,
	linux-kernel, linux-mm, tglx, viro, hpa, mingo,
	Sukadev Bhattiprolu, Alexey Dobriyan, xemul
Dave Hansen wrote:
> On Fri, 2009-03-13 at 14:01 -0700, Linus Torvalds wrote:
>> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
>>>> Let's face it, we're not going to _ever_ checkpoint any kind of general 
>>>> case process. Just TCP makes that fundamentally impossible in the general 
>>>> case, and there are lots and lots of other cases too (just something as 
>>>> totally _trivial_ as all the files in the filesystem that don't get rolled 
>>>> back).
>>> What do you mean here? Unlinked files?
>> Or modified files, or anything else. "External state" is a pretty damn 
>> wide net. It's not just TCP sequence numbers and another machine.
> 
> This is precisely the reason that we've focused so hard on containers,
> and *didn't* just jump right into checkpoint/restart; we're trying
> really hard to constrain the _truly_ external things that a process can
> interact with.  
> 
> The approach so far has largely been to make things are external to a
> process at least *internal* to a container.  Network, pid, ipc, and uts
> namespaces, for example.  An ipc/sem.c semaphore may be external to a
> process, so we'll just pick the whole namespace up and checkpoint it
> along with the process.
> 
> In the OpenVZ case, they've at least demonstrated that the filesystem
> can be moved largely with rsync.  Unlinked files need some in-kernel TLC
> (or /proc mangling) but it isn't *that* bad.
And in the Zap we have successfully used a log-based filesystem
(specifically NILFS) to continuously snapshot the file-system atomically
with taking a checkpoint, so it can easily branch off past checkpoints,
including the file system.
And unlinked files can be (inefficiently) handled by saving their full
contents with the checkpoint image - it's not a big toll on many apps
(if you exclude Wine and UML...). At least that's a start.
> 
> We can also make the fs problem much easier by using things like dm or
> btrfs snapshotting of the block device, or restricting to where on a fs
> a container is allowed to write with stuff like r/o bind mounts.
(or NILFS)
So we argue that the FS snapshotting is related, but orthogonal in terms
of implementation to c/r.
Oren.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13 20:48                                   ` Mike Waychison
@ 2009-03-13 22:35                                     ` Oren Laadan
  2009-03-18 18:54                                       ` Mike Waychison
  0 siblings, 1 reply; 121+ messages in thread
From: Oren Laadan @ 2009-03-13 22:35 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Linus Torvalds, Andrew Morton, linux-api, containers, hpa,
	linux-kernel, Dave Hansen, linux-mm, viro, mingo, mpm, tglx,
	Sukadev Bhattiprolu, Alexey Dobriyan, xemul
Mike Waychison wrote:
> Linus Torvalds wrote:
>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>
>>> Ying Han [yinghan@google.com] wrote:
>>> | Hi Serge:
>>> | I made a patch based on Oren's tree recently which implement a new
>>> | syscall clone_with_pid. I tested with checkpoint/restart process tree
>>> | and it works as expected.
>>>
>>> Yes, I think we had a version of clone() with pid a while ago.
>> Are people _at_all_ thinking about security?
>>
>> Obviously not.
>>
>> There's no way we can do anything like this. Sure, it's trivial to do 
>> inside the kernel. But it also sounds like a _wonderful_ attack vector 
>> against badly written user-land software that sends signals and has small 
>> races.
> 
> I'm not really sure how this is different than a malicious app going off 
> and spawning thousands of threads in an attempt to hit a target pid from 
> a security pov.  Sure, it makes it easier, but it's not like there is 
> anything in place to close the attack vector.
> 
>> Quite frankly, from having followed the discussion(s) over the last few 
>> weeks about checkpoint/restart in various forms, my reaction to just about 
>> _all_ of this is that people pushing this are pretty damn borderline. 
>>
>> I think you guys are working on all the wrong problems. 
>>
>> Let's face it, we're not going to _ever_ checkpoint any kind of general 
>> case process. Just TCP makes that fundamentally impossible in the general 
>> case, and there are lots and lots of other cases too (just something as 
>> totally _trivial_ as all the files in the filesystem that don't get rolled 
>> back).
> 
> In some instances such as ours, TCP is probably the easiest thing to 
> migrate.  In an rpc-based cluster application, TCP is nothing more than 
> an RPC channel and applications already have to handle RPC channel 
> failure and re-establishment.
> 
> I agree that this is not the 'general case' as you mention above 
> however.  This is the bit that sorta bothers me with the way the 
> implementation has been going so far on this list.  The implementation 
> that folks are building on top of Oren's patchset tries to be everything 
> to everybody.  For our purposes, we need to have the flexibility of 
> choosing *how* we checkpoint.  The line seems to be arbitrarily drawn at 
> the kernel being responsible for checkpointing and restoring all 
> resources associated with a task, and leaving userland with nothing more 
> than transporting filesystem bits.  This approach isn't flexible enough: 
>   Consider the case where we want to stub out most of the TCP file 
> descriptors with ECONNRESETed sockets because we know that they are RPC 
> sockets and can re-establish themselves, but we want to use some other 
> mechanism for TCP sockets we don't know much about.  The current 
> monolithic approach has zero flexibility for doing anything like this, 
> and I figure out how we could even fit anything like this in.
The flexibility exists, but wasn't spelled out, so here it is:
1) Similar to madvice(), I envision a cradvice() that could tell the c/r
something about specific resources, e.g.:
 * cradvice(CR_ADV_MEM, ptr, len)  -> don't save that memory, it's scratch
 * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET)  -> reset connection on restart
etc .. (nevermind the exact interface right now)
2) Tasks can ask to be notified (e.g. register a signal) when a checkpoint
or a restart complete successfully. At that time they can do their private
house-keeping if they know better.
3) If restoring some resource is significantly easier in user space (e.g. a
file-descriptor of some special device which user space knows how to
re-initialize), then the restarting task can prepare it ahead of time,
and, call:
  * cradvice(CR_ADV_USERFD, fd, 0)  -> use the fd in place instead of trying
				       to restore it yourself.
Method #3 is what I used in Zap to implement distributed checkpoints, where
it is so much easier to recreate all network connections in user space then
putting that logic into the kernel.
Now, on the other hand, doing the c/r from userland is much less flexible
than in the kernel (e.g. epollfd, futex state and much more) and requires
exposing tremendous amount of in-kernel data to user space. And we all know
than exposing internals is always a one-way ticket :(
[...]
Oren.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-02-13 22:28                   ` Alexey Dobriyan
@ 2009-03-14  0:04                     ` Eric W. Biederman
  2009-03-14  0:26                       ` Serge E. Hallyn
  0 siblings, 1 reply; 121+ messages in thread
From: Eric W. Biederman @ 2009-03-14  0:04 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, linux-api, containers, hpa, linux-kernel,
	Dave Hansen, linux-mm, viro, Matt Mackall, Andrew Morton,
	torvalds, tglx, Pavel Emelyanov
Alexey Dobriyan <adobriyan@gmail.com> writes:
> On Fri, Feb 13, 2009 at 12:45:03PM +0100, Ingo Molnar wrote:
>> 
>> * Alexey Dobriyan <adobriyan@gmail.com> wrote:
>> 
>> > On Fri, Feb 13, 2009 at 11:27:32AM +0100, Ingo Molnar wrote:
>> > > Merging checkpoints instead might give them the incentive to get
>> > > their act together.
>> > 
>> > Knowing how much time it takes to beat CPT back into usable shape every time
>> > big kernel rebase is done, OpenVZ/Virtuozzo have every single damn incentive
>> > to have CPT mainlined.
>> 
>> So where is the bottleneck? I suspect the effort in having forward ported
>> it across 4 major kernel releases in a single year is already larger than
>> the technical effort it would  take to upstream it. Any unreasonable upstream 
>> resistence/passivity you are bumping into?
>
> People were busy with netns/containers stuff and OpenVZ/Virtuozzo bugs.
Yes.  Getting the namespaces particularly the network namespace finished
has consumed a lot of work.
Then we have a bunch of people helping with ill conceived patches that seem
to wear out the patience of people upstream.  Al, Greg kh, Linus.
The whole recent ressurection of the question of we should have a clone
with pid syscall.
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                                         ` <alpine.LFD.2.00.0903131401070.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2009-03-14  0:20                                           ` Alexey Dobriyan
  2009-03-14  8:25                                             ` Ingo Molnar
  0 siblings, 1 reply; 121+ messages in thread
From: Alexey Dobriyan @ 2009-03-14  0:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sukadev Bhattiprolu, Ying Han, Serge E. Hallyn,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mingo-X9Un+BFzKDI,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, Andrew Morton,
	xemul-GEFAQzZX7r8dnm+yROfE0A, tglx-hfZtesqFncYOwBW4kG4KsQ
On Fri, Mar 13, 2009 at 02:01:50PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
> > > 
> > > Let's face it, we're not going to _ever_ checkpoint any kind of general 
> > > case process. Just TCP makes that fundamentally impossible in the general 
> > > case, and there are lots and lots of other cases too (just something as 
> > > totally _trivial_ as all the files in the filesystem that don't get rolled 
> > > back).
> > 
> > What do you mean here? Unlinked files?
> 
> Or modified files, or anything else. "External state" is a pretty damn 
> wide net. It's not just TCP sequence numbers and another machine.
I think (I think) you're seriously underestimating what's doable with
kernel C/R and what's already done.
I was told (haven't seen it myself) that Oracle installations and
Counter Strike servers were moved between boxes just fine.
They were run in specially prepared environment of course, but still.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: What can OpenVZ do?
  2009-03-14  0:04                     ` Eric W. Biederman
@ 2009-03-14  0:26                       ` Serge E. Hallyn
  0 siblings, 0 replies; 121+ messages in thread
From: Serge E. Hallyn @ 2009-03-14  0:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexey Dobriyan, linux-api, containers, Matt Mackall,
	linux-kernel, Dave Hansen, linux-mm, tglx, viro, hpa, Ingo Molnar,
	torvalds, Andrew Morton, Pavel Emelyanov
Quoting Eric W. Biederman (ebiederm@xmission.com):
> Alexey Dobriyan <adobriyan@gmail.com> writes:
> 
> > On Fri, Feb 13, 2009 at 12:45:03PM +0100, Ingo Molnar wrote:
> >> 
> >> * Alexey Dobriyan <adobriyan@gmail.com> wrote:
> >> 
> >> > On Fri, Feb 13, 2009 at 11:27:32AM +0100, Ingo Molnar wrote:
> >> > > Merging checkpoints instead might give them the incentive to get
> >> > > their act together.
> >> > 
> >> > Knowing how much time it takes to beat CPT back into usable shape every time
> >> > big kernel rebase is done, OpenVZ/Virtuozzo have every single damn incentive
> >> > to have CPT mainlined.
> >> 
> >> So where is the bottleneck? I suspect the effort in having forward ported
> >> it across 4 major kernel releases in a single year is already larger than
> >> the technical effort it would  take to upstream it. Any unreasonable upstream 
> >> resistence/passivity you are bumping into?
> >
> > People were busy with netns/containers stuff and OpenVZ/Virtuozzo bugs.
> 
> Yes.  Getting the namespaces particularly the network namespace finished
> has consumed a lot of work.
> 
> Then we have a bunch of people helping with ill conceived patches that seem
> to wear out the patience of people upstream.  Al, Greg kh, Linus.
> 
> The whole recent ressurection of the question of we should have a clone
> with pid syscall.
/me points
Alexey started it :)
But, Linus asks to start with simple checkpoint/restart patches.  Oren's
basic patchset pretty much does that, though, right?  Patches 1-7 just
do a basic single task.  8-10 add simple open files.  11, 13 and 14 do
external checkpoint and multiple tasks.
Are these an ok place to start, or do these need to be simplified even
more?
-serge
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13 22:15                                           ` Oren Laadan
@ 2009-03-14  0:27                                             ` Eric W. Biederman
  2009-03-14  8:12                                               ` Ingo Molnar
  0 siblings, 1 reply; 121+ messages in thread
From: Eric W. Biederman @ 2009-03-14  0:27 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, linux-api, containers, hpa, linux-kernel,
	Alexey Dobriyan, linux-mm, viro, mingo, mpm, Andrew Morton,
	Sukadev Bhattiprolu, Linus Torvalds, tglx, xemul
Oren Laadan <orenl@cs.columbia.edu> writes:
> Dave Hansen wrote:
>> On Fri, 2009-03-13 at 14:01 -0700, Linus Torvalds wrote:
>>> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
>>>>> Let's face it, we're not going to _ever_ checkpoint any kind of general 
>>>>> case process. Just TCP makes that fundamentally impossible in the general 
>>>>> case, and there are lots and lots of other cases too (just something as 
>>>>> totally _trivial_ as all the files in the filesystem that don't get rolled 
>>>>> back).
>>>> What do you mean here? Unlinked files?
>>> Or modified files, or anything else. "External state" is a pretty damn 
>>> wide net. It's not just TCP sequence numbers and another machine.
>> 
>> This is precisely the reason that we've focused so hard on containers,
>> and *didn't* just jump right into checkpoint/restart; we're trying
>> really hard to constrain the _truly_ external things that a process can
>> interact with.  
>> 
>> The approach so far has largely been to make things are external to a
>> process at least *internal* to a container.  Network, pid, ipc, and uts
>> namespaces, for example.  An ipc/sem.c semaphore may be external to a
>> process, so we'll just pick the whole namespace up and checkpoint it
>> along with the process.
>> 
>> In the OpenVZ case, they've at least demonstrated that the filesystem
>> can be moved largely with rsync.  Unlinked files need some in-kernel TLC
>> (or /proc mangling) but it isn't *that* bad.
>
> And in the Zap we have successfully used a log-based filesystem
> (specifically NILFS) to continuously snapshot the file-system atomically
> with taking a checkpoint, so it can easily branch off past checkpoints,
> including the file system.
>
> And unlinked files can be (inefficiently) handled by saving their full
> contents with the checkpoint image - it's not a big toll on many apps
> (if you exclude Wine and UML...). At least that's a start.
Oren we might want to do a proof of concept implementation like I did
with network namespaces.  That is done in the community and goes far
enough to show we don't have horribly nasty code.  The patches and
individual changes don't need to be quite perfect but close enough
that they can be considered for merging.
For the network namespace that seems to have made a big difference.
I'm afraid in our clean start we may have focused a little too much
on merging something simple and not gone far enough on showing that
things will work.
After I had that in the network namespace and we had a clear vision of
the direction.   We started merging the individual patches and things
went well.
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-14  0:27                                             ` Eric W. Biederman
@ 2009-03-14  8:12                                               ` Ingo Molnar
  2009-03-16 22:33                                                 ` Kevin Fox
  2009-03-19 21:19                                                 ` Eric W. Biederman
  0 siblings, 2 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-14  8:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oren Laadan, Dave Hansen, linux-api, containers, hpa,
	linux-kernel, Alexey Dobriyan, linux-mm, viro, mpm, Andrew Morton,
	Sukadev Bhattiprolu, Linus Torvalds, tglx, xemul
* Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> In the OpenVZ case, they've at least demonstrated that the 
> >> filesystem can be moved largely with rsync.  Unlinked files 
> >> need some in-kernel TLC (or /proc mangling) but it isn't 
> >> *that* bad.
> >
> > And in the Zap we have successfully used a log-based 
> > filesystem (specifically NILFS) to continuously snapshot the 
> > file-system atomically with taking a checkpoint, so it can 
> > easily branch off past checkpoints, including the file 
> > system.
> >
> > And unlinked files can be (inefficiently) handled by saving 
> > their full contents with the checkpoint image - it's not a 
> > big toll on many apps (if you exclude Wine and UML...). At 
> > least that's a start.
> 
> Oren we might want to do a proof of concept implementation 
> like I did with network namespaces.  That is done in the 
> community and goes far enough to show we don't have horribly 
> nasty code.  The patches and individual changes don't need to 
> be quite perfect but close enough that they can be considered 
> for merging.
> 
> For the network namespace that seems to have made a big 
> difference.
> 
> I'm afraid in our clean start we may have focused a little too 
> much on merging something simple and not gone far enough on 
> showing that things will work.
> 
> After I had that in the network namespace and we had a clear 
> vision of the direction.  We started merging the individual 
> patches and things went well.
I'm curious: what is the actual end result other than good 
looking code? In terms of tangible benefits to the everyday 
Linux distro user. [This is not meant to be sarcastic, i'm
truly curious.]
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-14  0:20                                           ` Alexey Dobriyan
@ 2009-03-14  8:25                                             ` Ingo Molnar
       [not found]                                               ` <20090314082532.GB16436-X9Un+BFzKDI@public.gmane.org>
  2009-03-16  6:01                                               ` Oren Laadan
  0 siblings, 2 replies; 121+ messages in thread
From: Ingo Molnar @ 2009-03-14  8:25 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Linus Torvalds, Sukadev Bhattiprolu, Ying Han, Serge E. Hallyn,
	linux-api, containers, hpa, linux-kernel, Dave Hansen, linux-mm,
	viro, mpm, Andrew Morton, xemul, tglx
* Alexey Dobriyan <adobriyan@gmail.com> wrote:
> On Fri, Mar 13, 2009 at 02:01:50PM -0700, Linus Torvalds wrote:
> > 
> > 
> > On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
> > > > 
> > > > Let's face it, we're not going to _ever_ checkpoint any 
> > > > kind of general case process. Just TCP makes that 
> > > > fundamentally impossible in the general case, and there 
> > > > are lots and lots of other cases too (just something as 
> > > > totally _trivial_ as all the files in the filesystem 
> > > > that don't get rolled back).
> > > 
> > > What do you mean here? Unlinked files?
> > 
> > Or modified files, or anything else. "External state" is a 
> > pretty damn wide net. It's not just TCP sequence numbers and 
> > another machine.
> 
> I think (I think) you're seriously underestimating what's 
> doable with kernel C/R and what's already done.
> 
> I was told (haven't seen it myself) that Oracle installations 
> and Counter Strike servers were moved between boxes just fine.
> 
> They were run in specially prepared environment of course, but 
> still.
That's the kind of stuff i'd like to see happen.
Right now the main 'enterprise' approach to do 
migration/consolidation of server contexts is based on hardware 
virtualization - but that pushes runtime overhead to the native 
kernel and slows down the guest context as well - massively so.
Before we've blinked twice it will be a 'required' enterprise 
feature and enterprise people will measure/benchmark Linux 
server performance in guest context primarily and we'll have a 
deep performance pit to dig ourselves out of.
We can ignore that trend as uninteresting (it is uninteresting 
in a number of ways because it is partly driven by stupidity), 
or we can do something about it while still advancing the 
kernel.
With containers+checkpointing the code is a lot scarier (we 
basically do system call virtualization), the environment 
interactions are a lot wider and thus they are a lot more 
difficult to handle - but it's all a lot faster as well, and 
conceptually so. All the runtime overhead is pushed to the 
checkpointing step - (with some minimal amount of data structure 
isolation overhead).
I see three conceptual levels of virtualization:
 - hardware based virtualization, for 'unaware OSs'
 - system call based virtualization, for 'unaware software'
 - no virtualization kernel help is needed _at all_ to 
   checkpoint 'aware' software. We have libraries to checkpoint 
   'aware' user-space just fine - and had them for a decade.
	Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
       [not found]                                               ` <20090314082532.GB16436-X9Un+BFzKDI@public.gmane.org>
@ 2009-03-14 17:11                                                 ` Joseph Ruscio
  0 siblings, 0 replies; 121+ messages in thread
From: Joseph Ruscio @ 2009-03-14 17:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	Andrew Morton, Sukadev Bhattiprolu, Linus Torvalds,
	Alexey Dobriyan, xemul-GEFAQzZX7r8dnm+yROfE0A
On Mar 14, 2009, at 1:25 AM, Ingo Molnar wrote:
> * Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>> On Fri, Mar 13, 2009 at 02:01:50PM -0700, Linus Torvalds wrote:
>>>
>>>
>>> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
>>>>>
>>>>> Let's face it, we're not going to _ever_ checkpoint any
>>>>> kind of general case process. Just TCP makes that
>>>>> fundamentally impossible in the general case, and there
>>>>> are lots and lots of other cases too (just something as
>>>>> totally _trivial_ as all the files in the filesystem
>>>>> that don't get rolled back).
>>>>
>>>> What do you mean here? Unlinked files?
>>>
>>> Or modified files, or anything else. "External state" is a
>>> pretty damn wide net. It's not just TCP sequence numbers and
>>> another machine.
>>
>> I think (I think) you're seriously underestimating what's
>> doable with kernel C/R and what's already done.
>>
>> I was told (haven't seen it myself) that Oracle installations
>> and Counter Strike servers were moved between boxes just fine.
>>
>> They were run in specially prepared environment of course, but
>> still.
>
> That's the kind of stuff i'd like to see happen.
>
> Right now the main 'enterprise' approach to do
> migration/consolidation of server contexts is based on hardware
> virtualization - but that pushes runtime overhead to the native
> kernel and slows down the guest context as well - massively so.
>
> Before we've blinked twice it will be a 'required' enterprise
> feature and enterprise people will measure/benchmark Linux
> server performance in guest context primarily and we'll have a
> deep performance pit to dig ourselves out of.
>
> We can ignore that trend as uninteresting (it is uninteresting
> in a number of ways because it is partly driven by stupidity),
> or we can do something about it while still advancing the
> kernel.
I'd tend to echo these comments. I don't think you can underestimate  
how many workloads are stuck in VM's (or under consideration for such)  
mainly in order to containerize them and make them mobile. Right now  
VM's are the only hammer, so every virtualization scenario looks like  
a nail. As an extreme example, some of the National Labs are  
experimenting with VM's to checkpoint long-running jobs or live- 
migrate a part of a job off a machine throwing hardware errors (soon  
to fail). They're trying this approach even though VM's can add a  
significant overhead (in the I/O path), typically considered the third  
rail in HPC.
KVM is a step in the right direction, because we can now locate some  
number of VM's with a native workload, but the OpenVZ guys have shown  
that you can achieve much higher densities with an OS Virtualization  
container approach.
Joe
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-14  8:25                                             ` Ingo Molnar
       [not found]                                               ` <20090314082532.GB16436-X9Un+BFzKDI@public.gmane.org>
@ 2009-03-16  6:01                                               ` Oren Laadan
  1 sibling, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-03-16  6:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexey Dobriyan, linux-api, containers, mpm, linux-kernel,
	Dave Hansen, linux-mm, viro, hpa, Andrew Morton,
	Sukadev Bhattiprolu, Linus Torvalds, tglx, xemul
Ingo Molnar wrote:
> * Alexey Dobriyan <adobriyan@gmail.com> wrote:
> 
>> On Fri, Mar 13, 2009 at 02:01:50PM -0700, Linus Torvalds wrote:
>>>
>>> On Fri, 13 Mar 2009, Alexey Dobriyan wrote:
>>>>> Let's face it, we're not going to _ever_ checkpoint any 
>>>>> kind of general case process. Just TCP makes that 
>>>>> fundamentally impossible in the general case, and there 
>>>>> are lots and lots of other cases too (just something as 
>>>>> totally _trivial_ as all the files in the filesystem 
>>>>> that don't get rolled back).
>>>> What do you mean here? Unlinked files?
>>> Or modified files, or anything else. "External state" is a 
>>> pretty damn wide net. It's not just TCP sequence numbers and 
>>> another machine.
>> I think (I think) you're seriously underestimating what's 
>> doable with kernel C/R and what's already done.
>>
>> I was told (haven't seen it myself) that Oracle installations 
>> and Counter Strike servers were moved between boxes just fine.
>>
>> They were run in specially prepared environment of course, but 
>> still.
> 
> That's the kind of stuff i'd like to see happen.
> 
> Right now the main 'enterprise' approach to do 
> migration/consolidation of server contexts is based on hardware 
> virtualization - but that pushes runtime overhead to the native 
> kernel and slows down the guest context as well - massively so.
> 
> Before we've blinked twice it will be a 'required' enterprise 
> feature and enterprise people will measure/benchmark Linux 
> server performance in guest context primarily and we'll have a 
> deep performance pit to dig ourselves out of.
> 
> We can ignore that trend as uninteresting (it is uninteresting 
> in a number of ways because it is partly driven by stupidity), 
> or we can do something about it while still advancing the 
> kernel.
> 
> With containers+checkpointing the code is a lot scarier (we 
> basically do system call virtualization), the environment 
> interactions are a lot wider and thus they are a lot more 
> difficult to handle - but it's all a lot faster as well, and 
> conceptually so. All the runtime overhead is pushed to the 
> checkpointing step - (with some minimal amount of data structure 
> isolation overhead).
It's worthwhile the make the distinction between virtualization and
checkpoint/restart (c/r). Virtualization is about decoupling of the
applications from the underlying operating system by providing a
private and and virtual namespace, that is - containers. Checkpoint/
restart is ability to save the state of a container so that it can
be restart later from that point.
The point is, that virtualization is *already* part of the kernel
through namespaces (pid, ipc, mounts, etc). This considerable body
of work was eventually merged and is mostly complete, covering most
of the environment interactions. The runtime overhead is negligible.
Seeing that namespaces are now part of the kernel, we now build on
the existing virtualization to allow checkpoint/restart. The code is
not at all scary: record the state on checkpoint, and restore it on
restart. There is no runtime overhead for checkpoint but the downtime
incurred on an application when it is frozen for the duration of the
checkpoint.
> 
> I see three conceptual levels of virtualization:
> 
>  - hardware based virtualization, for 'unaware OSs'
> 
>  - system call based virtualization, for 'unaware software'
> 
>  - no virtualization kernel help is needed _at all_ to 
>    checkpoint 'aware' software. We have libraries to checkpoint 
>    'aware' user-space just fine - and had them for a decade.
Checkpoint/restart is almost orthogonal to virtualization (c/r only
needs a way to request a specific resource identifier for resources
that it creates). Therefore, the effort required to allow c/r of
'aware' software is nearly the same as for 'unaware' software.
IMHO this is the natural next time: make the c/r useful and attractive
by making it transparent (support 'unaware' software), complete (cover
nearly all features) and efficient (with low application downtime).
And this is precisely what we aim for with the current patchset.
Oren.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-14  8:12                                               ` Ingo Molnar
@ 2009-03-16 22:33                                                 ` Kevin Fox
  2009-03-19 21:19                                                 ` Eric W. Biederman
  1 sibling, 0 replies; 121+ messages in thread
From: Kevin Fox @ 2009-03-16 22:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric W. Biederman, containers, hpa, linux-kernel, Dave Hansen,
	mpm, linux-mm, tglx, viro, linux-api, Andrew Morton,
	Sukadev Bhattiprolu, Linus Torvalds, Alexey Dobriyan, xemul
On Sat, 2009-03-14 at 01:12 -0700, Ingo Molnar wrote:
> 
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
> 
> > >> In the OpenVZ case, they've at least demonstrated that the
> > >> filesystem can be moved largely with rsync.  Unlinked files
> > >> need some in-kernel TLC (or /proc mangling) but it isn't
> > >> *that* bad.
> > >
> > > And in the Zap we have successfully used a log-based
> > > filesystem (specifically NILFS) to continuously snapshot the
> > > file-system atomically with taking a checkpoint, so it can
> > > easily branch off past checkpoints, including the file
> > > system.
> > >
> > > And unlinked files can be (inefficiently) handled by saving
> > > their full contents with the checkpoint image - it's not a
> > > big toll on many apps (if you exclude Wine and UML...). At
> > > least that's a start.
> >
> > Oren we might want to do a proof of concept implementation
> > like I did with network namespaces.  That is done in the
> > community and goes far enough to show we don't have horribly
> > nasty code.  The patches and individual changes don't need to
> > be quite perfect but close enough that they can be considered
> > for merging.
> >
> > For the network namespace that seems to have made a big
> > difference.
> >
> > I'm afraid in our clean start we may have focused a little too
> > much on merging something simple and not gone far enough on
> > showing that things will work.
> >
> > After I had that in the network namespace and we had a clear
> > vision of the direction.  We started merging the individual
> > patches and things went well.
> 
> I'm curious: what is the actual end result other than good
> looking code? In terms of tangible benefits to the everyday
> Linux distro user. [This is not meant to be sarcastic, i'm
> truly curious.]
>From an ordinary user perspective, I hate loosing my desktop state every
time there is a power bump or a new kernel/video driver comes down from
the distro provider. Some of the stuff I loose:
*All my terminals
    *many tabs and windows
    *each in a different directory
    *vi
       *which files I was editing
       *which function I was coding
    *screen
    *scrollback buffer's contents
         *history for debugging code
    *command line arguments
*State of running apps
    *web browser
        *Tabs, yes it saves urls on crash, but sometimes the page cant
come back up (say, because of a form)
        *where the windows are on the desktop
    *evolution
        *what folder is selected
        *which message within the folder is selected
    *rhythmbox
    *misc other apps
Being able to reboot and get back to exactly where I was before the
reboot would save me a lot of time restarting apps and getting my
desktop back to where it was before the reboot. I'd also be more
inclined to reboot to get security updates more frequently if I didn't
loose track of what I was doing in the session, making machines more
secure in the process.
Kevin
PS: Yes, I know both GNOME and KDE have tried to deal with some of this
with their session manager stuff, but it doesn't restore everything and
only supported by some apps. It would probably take more work to get all
apps working with the session management stuff then supporting kernel
C/R.
>         Ingo
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: [RFC v13][PATCH 05/14] x86 support for checkpoint/restart
  2009-02-24  7:47   ` Nathan Lynch
       [not found]     ` <20090224014739.1b82fc35-4v5LP+xe+1byhTdZtsIeww@public.gmane.org>
@ 2009-03-18  7:21     ` Oren Laadan
  1 sibling, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-03-18  7:21 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Andrew Morton, linux-api, containers, linux-kernel, Dave Hansen,
	linux-mm, Linus Torvalds, Alexander Viro, H. Peter Anvin,
	Thomas Gleixner, Ingo Molnar
Nathan Lynch wrote:
> Hi, this is an old thread I guess, but I just noticed some issues while
> looking at this code.
> 
> On Tue, 27 Jan 2009 12:08:03 -0500
> Oren Laadan <orenl@cs.columbia.edu> wrote:
>> +static int cr_read_cpu_fpu(struct cr_ctx *ctx, struct task_struct *t)
>> +{
>> +	void *xstate_buf = cr_hbuf_get(ctx, xstate_size);
>> +	int ret;
>> +
>> +	ret = cr_kread(ctx, xstate_buf, xstate_size);
>> +	if (ret < 0)
>> +		goto out;
>> +
>> +	/* i387 + MMU + SSE */
>> +	preempt_disable();
>> +
>> +	/* init_fpu() also calls set_used_math() */
>> +	ret = init_fpu(current);
>> +	if (ret < 0)
>> +		return ret;
> 
> Several problems here:
> * init_fpu can call kmem_cache_alloc(GFP_KERNEL), but is called here
>   with preempt disabled (init_fpu could use a might_sleep annotation?)
> * if init_fpu returns an error, we get preempt imbalance
> * if init_fpu returns an error, we "leak" the cr_hbuf_get for
>   xstate_buf
Fixed, thanks.
> 
> Speaking of cr_hbuf_get... I'd prefer to see that "allocator" go away
> and its users converted to kmalloc/kfree (this is what I've done for
> the powerpc C/R code, btw).
> 
> Using the slab allocator would:
> 
> * make the code less obscure and easier to review
> * make the code more amenable to static analysis
> * gain the benefits of slab debugging at runtime
> 
> But I think this has been pointed out before.  If I understand the
> justification for cr_hbuf_get correctly, the allocations it services
> are somehow known to be bounded in size and nesting.  But even if that
> is the case, it's not much of a reason to avoid using kmalloc, is it?
> 
The reason I want these wrappers (as opposed to allocators) in place is
allow an optimization that will reduce application downtime during checkpoint.
Since we freeze the container during checkpoint, the applications inside are
unresponsive. The idea is to minimize the downtime by buffering the checkpoint
data in the kernel while the applications are frozen, and defer the (slow)
write-back of the buffer until after the application is allowed to resume
execution. (Memory pages will be marked COW instead of a physical copy in the
kernel).
To that, we'll need the wrapper to not only allocate memory, but also track
all the pieces together as a long buffer. Actual implementation details are
not important now, but having a wrapper in place is.
Consequently, although I prefer not to change the current implementation of
cr_hbuf_get/put(), if you find it really helpful to change to kmalloc/kfree
I won't stand in the way. However, I do insist that the wrappers remain.
Oren.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-13 22:35                                     ` Oren Laadan
@ 2009-03-18 18:54                                       ` Mike Waychison
  2009-03-18 19:04                                         ` Oren Laadan
  0 siblings, 1 reply; 121+ messages in thread
From: Mike Waychison @ 2009-03-18 18:54 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Linus Torvalds, Andrew Morton, linux-api, containers, hpa,
	linux-kernel, Dave Hansen, linux-mm, viro, mingo, mpm, tglx,
	Sukadev Bhattiprolu, Alexey Dobriyan, xemul
Oren Laadan wrote:
> 
> Mike Waychison wrote:
>> Linus Torvalds wrote:
>>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>>
>>>> Ying Han [yinghan@google.com] wrote:
>>>> | Hi Serge:
>>>> | I made a patch based on Oren's tree recently which implement a new
>>>> | syscall clone_with_pid. I tested with checkpoint/restart process tree
>>>> | and it works as expected.
>>>>
>>>> Yes, I think we had a version of clone() with pid a while ago.
>>> Are people _at_all_ thinking about security?
>>>
>>> Obviously not.
>>>
>>> There's no way we can do anything like this. Sure, it's trivial to do 
>>> inside the kernel. But it also sounds like a _wonderful_ attack vector 
>>> against badly written user-land software that sends signals and has small 
>>> races.
>> I'm not really sure how this is different than a malicious app going off 
>> and spawning thousands of threads in an attempt to hit a target pid from 
>> a security pov.  Sure, it makes it easier, but it's not like there is 
>> anything in place to close the attack vector.
>>
>>> Quite frankly, from having followed the discussion(s) over the last few 
>>> weeks about checkpoint/restart in various forms, my reaction to just about 
>>> _all_ of this is that people pushing this are pretty damn borderline. 
>>>
>>> I think you guys are working on all the wrong problems. 
>>>
>>> Let's face it, we're not going to _ever_ checkpoint any kind of general 
>>> case process. Just TCP makes that fundamentally impossible in the general 
>>> case, and there are lots and lots of other cases too (just something as 
>>> totally _trivial_ as all the files in the filesystem that don't get rolled 
>>> back).
>> In some instances such as ours, TCP is probably the easiest thing to 
>> migrate.  In an rpc-based cluster application, TCP is nothing more than 
>> an RPC channel and applications already have to handle RPC channel 
>> failure and re-establishment.
>>
>> I agree that this is not the 'general case' as you mention above 
>> however.  This is the bit that sorta bothers me with the way the 
>> implementation has been going so far on this list.  The implementation 
>> that folks are building on top of Oren's patchset tries to be everything 
>> to everybody.  For our purposes, we need to have the flexibility of 
>> choosing *how* we checkpoint.  The line seems to be arbitrarily drawn at 
>> the kernel being responsible for checkpointing and restoring all 
>> resources associated with a task, and leaving userland with nothing more 
>> than transporting filesystem bits.  This approach isn't flexible enough: 
>>   Consider the case where we want to stub out most of the TCP file 
>> descriptors with ECONNRESETed sockets because we know that they are RPC 
>> sockets and can re-establish themselves, but we want to use some other 
>> mechanism for TCP sockets we don't know much about.  The current 
>> monolithic approach has zero flexibility for doing anything like this, 
>> and I figure out how we could even fit anything like this in.
> 
> The flexibility exists, but wasn't spelled out, so here it is:
> 
> 1) Similar to madvice(), I envision a cradvice() that could tell the c/r
> something about specific resources, e.g.:
>  * cradvice(CR_ADV_MEM, ptr, len)  -> don't save that memory, it's scratch
>  * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET)  -> reset connection on restart
> etc .. (nevermind the exact interface right now)
> 
> 2) Tasks can ask to be notified (e.g. register a signal) when a checkpoint
> or a restart complete successfully. At that time they can do their private
> house-keeping if they know better.
> 
> 3) If restoring some resource is significantly easier in user space (e.g. a
> file-descriptor of some special device which user space knows how to
> re-initialize), then the restarting task can prepare it ahead of time,
> and, call:
>   * cradvice(CR_ADV_USERFD, fd, 0)  -> use the fd in place instead of trying
> 				       to restore it yourself.
This would be called by the embryo process (mktree.c?) before calling 
sys_restart?
> 
> Method #3 is what I used in Zap to implement distributed checkpoints, where
> it is so much easier to recreate all network connections in user space then
> putting that logic into the kernel.
> 
> Now, on the other hand, doing the c/r from userland is much less flexible
> than in the kernel (e.g. epollfd, futex state and much more) and requires
> exposing tremendous amount of in-kernel data to user space. And we all know
> than exposing internals is always a one-way ticket :(
> 
> [...]
> 
> Oren.
> 
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-18 18:54                                       ` Mike Waychison
@ 2009-03-18 19:04                                         ` Oren Laadan
  0 siblings, 0 replies; 121+ messages in thread
From: Oren Laadan @ 2009-03-18 19:04 UTC (permalink / raw)
  To: Mike Waychison
  Cc: Linus Torvalds, Andrew Morton, linux-api, containers, hpa,
	linux-kernel, Dave Hansen, linux-mm, viro, mingo, mpm, tglx,
	Sukadev Bhattiprolu, Alexey Dobriyan, xemul
Mike Waychison wrote:
> Oren Laadan wrote:
>>
>> Mike Waychison wrote:
>>> Linus Torvalds wrote:
>>>> On Thu, 12 Mar 2009, Sukadev Bhattiprolu wrote:
>>>>
>>>>> Ying Han [yinghan@google.com] wrote:
>>>>> | Hi Serge:
>>>>> | I made a patch based on Oren's tree recently which implement a new
>>>>> | syscall clone_with_pid. I tested with checkpoint/restart process
>>>>> tree
>>>>> | and it works as expected.
>>>>>
>>>>> Yes, I think we had a version of clone() with pid a while ago.
>>>> Are people _at_all_ thinking about security?
>>>>
>>>> Obviously not.
>>>>
>>>> There's no way we can do anything like this. Sure, it's trivial to
>>>> do inside the kernel. But it also sounds like a _wonderful_ attack
>>>> vector against badly written user-land software that sends signals
>>>> and has small races.
>>> I'm not really sure how this is different than a malicious app going
>>> off and spawning thousands of threads in an attempt to hit a target
>>> pid from a security pov.  Sure, it makes it easier, but it's not like
>>> there is anything in place to close the attack vector.
>>>
>>>> Quite frankly, from having followed the discussion(s) over the last
>>>> few weeks about checkpoint/restart in various forms, my reaction to
>>>> just about _all_ of this is that people pushing this are pretty damn
>>>> borderline.
>>>> I think you guys are working on all the wrong problems.
>>>> Let's face it, we're not going to _ever_ checkpoint any kind of
>>>> general case process. Just TCP makes that fundamentally impossible
>>>> in the general case, and there are lots and lots of other cases too
>>>> (just something as totally _trivial_ as all the files in the
>>>> filesystem that don't get rolled back).
>>> In some instances such as ours, TCP is probably the easiest thing to
>>> migrate.  In an rpc-based cluster application, TCP is nothing more
>>> than an RPC channel and applications already have to handle RPC
>>> channel failure and re-establishment.
>>>
>>> I agree that this is not the 'general case' as you mention above
>>> however.  This is the bit that sorta bothers me with the way the
>>> implementation has been going so far on this list.  The
>>> implementation that folks are building on top of Oren's patchset
>>> tries to be everything to everybody.  For our purposes, we need to
>>> have the flexibility of choosing *how* we checkpoint.  The line seems
>>> to be arbitrarily drawn at the kernel being responsible for
>>> checkpointing and restoring all resources associated with a task, and
>>> leaving userland with nothing more than transporting filesystem
>>> bits.  This approach isn't flexible enough:   Consider the case where
>>> we want to stub out most of the TCP file descriptors with
>>> ECONNRESETed sockets because we know that they are RPC sockets and
>>> can re-establish themselves, but we want to use some other mechanism
>>> for TCP sockets we don't know much about.  The current monolithic
>>> approach has zero flexibility for doing anything like this, and I
>>> figure out how we could even fit anything like this in.
>>
>> The flexibility exists, but wasn't spelled out, so here it is:
>>
>> 1) Similar to madvice(), I envision a cradvice() that could tell the c/r
>> something about specific resources, e.g.:
>>  * cradvice(CR_ADV_MEM, ptr, len)  -> don't save that memory, it's
>> scratch
>>  * cradvice(CR_ADV_SOCK, fd, CR_ADV_SOCK_RESET)  -> reset connection
>> on restart
>> etc .. (nevermind the exact interface right now)
>>
>> 2) Tasks can ask to be notified (e.g. register a signal) when a
>> checkpoint
>> or a restart complete successfully. At that time they can do their
>> private
>> house-keeping if they know better.
>>
>> 3) If restoring some resource is significantly easier in user space
>> (e.g. a
>> file-descriptor of some special device which user space knows how to
>> re-initialize), then the restarting task can prepare it ahead of time,
>> and, call:
>>   * cradvice(CR_ADV_USERFD, fd, 0)  -> use the fd in place instead of
>> trying
>>                        to restore it yourself.
> 
> This would be called by the embryo process (mktree.c?) before calling
> sys_restart?
Yes.
> 
>>
>> Method #3 is what I used in Zap to implement distributed checkpoints,
>> where
>> it is so much easier to recreate all network connections in user space
>> then
>> putting that logic into the kernel.
>>
>> Now, on the other hand, doing the c/r from userland is much less flexible
>> than in the kernel (e.g. epollfd, futex state and much more) and requires
>> exposing tremendous amount of in-kernel data to user space. And we all
>> know
>> than exposing internals is always a one-way ticket :(
>>
>> [...]
>>
>> Oren.
>>
>>
> 
> 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
* Re: How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
  2009-03-14  8:12                                               ` Ingo Molnar
  2009-03-16 22:33                                                 ` Kevin Fox
@ 2009-03-19 21:19                                                 ` Eric W. Biederman
  1 sibling, 0 replies; 121+ messages in thread
From: Eric W. Biederman @ 2009-03-19 21:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Oren Laadan, Dave Hansen, linux-api, containers, hpa,
	linux-kernel, Alexey Dobriyan, linux-mm, viro, mpm, Andrew Morton,
	Sukadev Bhattiprolu, Linus Torvalds, tglx, xemul
Ingo Molnar <mingo@elte.hu> writes:
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>
>> >> In the OpenVZ case, they've at least demonstrated that the 
>> >> filesystem can be moved largely with rsync.  Unlinked files 
>> >> need some in-kernel TLC (or /proc mangling) but it isn't 
>> >> *that* bad.
>> >
>> > And in the Zap we have successfully used a log-based 
>> > filesystem (specifically NILFS) to continuously snapshot the 
>> > file-system atomically with taking a checkpoint, so it can 
>> > easily branch off past checkpoints, including the file 
>> > system.
>> >
>> > And unlinked files can be (inefficiently) handled by saving 
>> > their full contents with the checkpoint image - it's not a 
>> > big toll on many apps (if you exclude Wine and UML...). At 
>> > least that's a start.
>> 
>> Oren we might want to do a proof of concept implementation 
>> like I did with network namespaces.  That is done in the 
>> community and goes far enough to show we don't have horribly 
>> nasty code.  The patches and individual changes don't need to 
>> be quite perfect but close enough that they can be considered 
>> for merging.
>> 
>> For the network namespace that seems to have made a big 
>> difference.
>> 
>> I'm afraid in our clean start we may have focused a little too 
>> much on merging something simple and not gone far enough on 
>> showing that things will work.
>> 
>> After I had that in the network namespace and we had a clear 
>> vision of the direction.  We started merging the individual 
>> patches and things went well.
>
> I'm curious: what is the actual end result other than good 
> looking code? In terms of tangible benefits to the everyday 
> Linux distro user. [This is not meant to be sarcastic, i'm
> truly curious.]
Of the network namespace?  Sorry I'm not certain what you are asking.
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 121+ messages in thread
end of thread, other threads:[~2009-03-19 21:19 UTC | newest]
Thread overview: 121+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-27 17:07 [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Oren Laadan
2009-01-27 17:07 ` [RFC v13][PATCH 01/14] Create syscalls: sys_checkpoint, sys_restart Oren Laadan
2009-01-27 17:20   ` Randy Dunlap
2009-01-27 17:08 ` [RFC v13][PATCH 02/14] Checkpoint/restart: initial documentation Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 03/14] Make file_pos_read/write() public Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 04/14] General infrastructure for checkpoint restart Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 05/14] x86 support for checkpoint/restart Oren Laadan
2009-02-24  7:47   ` Nathan Lynch
     [not found]     ` <20090224014739.1b82fc35-4v5LP+xe+1byhTdZtsIeww@public.gmane.org>
2009-02-24 16:06       ` Dave Hansen
2009-03-18  7:21     ` Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 06/14] Dump memory address space Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 07/14] Restore " Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 08/14] Infrastructure for shared objects Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 09/14] Dump open file descriptors Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 11/14] External checkpoint of a task other than ourself Oren Laadan
2009-01-27 17:08 ` [RFC v13][PATCH 13/14] Checkpoint multiple processes Oren Laadan
     [not found] ` <1233076092-8660-1-git-send-email-orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-01-27 17:08   ` [RFC v13][PATCH 10/14] Restore open file descriprtors Oren Laadan
2009-01-27 17:08   ` [RFC v13][PATCH 12/14] Track in-kernel when we expect checkpoint/restart to work Oren Laadan
2009-01-27 17:08   ` [RFC v13][PATCH 14/14] Restart multiple processes Oren Laadan
2009-02-10 17:05 ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
2009-02-11 22:14   ` Andrew Morton
2009-02-12  9:17     ` Ingo Molnar
     [not found]       ` <20090212091721.GB1888-X9Un+BFzKDI@public.gmane.org>
2009-02-12 18:11         ` Dave Hansen
2009-02-12 20:48           ` Serge E. Hallyn
2009-02-13 10:20           ` Ingo Molnar
2009-02-12 18:11     ` Dave Hansen
2009-02-12 19:30       ` Matt Mackall
2009-02-12 19:42         ` Andrew Morton
2009-02-12 21:51           ` What can OpenVZ do? Dave Hansen
2009-02-12 22:10             ` Andrew Morton
2009-02-12 23:04               ` How much of a mess does OpenVZ make? ;) Was: " Dave Hansen
2009-02-26 15:57                 ` Alexey Dobriyan
2009-03-10 21:53                   ` Alexey Dobriyan
2009-03-10 23:28                     ` Serge E. Hallyn
2009-03-11  8:26                     ` Cedric Le Goater
2009-03-12 14:53                       ` Serge E. Hallyn
2009-03-12 21:01                         ` Greg Kurz
2009-03-12 21:21                           ` Serge E. Hallyn
2009-03-13  4:29                             ` Ying Han
2009-03-13  5:34                               ` Sukadev Bhattiprolu
     [not found]                                 ` <20090313053458.GA28833-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-03-13  6:19                                   ` Ying Han
2009-03-13 17:27                                 ` Linus Torvalds
2009-03-13 19:02                                   ` Serge E. Hallyn
     [not found]                                   ` <alpine.LFD.2.00.0903131018390.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-03-13 19:35                                     ` Alexey Dobriyan
2009-03-13 21:01                                       ` Linus Torvalds
2009-03-13 21:51                                         ` Dave Hansen
2009-03-13 22:15                                           ` Oren Laadan
2009-03-14  0:27                                             ` Eric W. Biederman
2009-03-14  8:12                                               ` Ingo Molnar
2009-03-16 22:33                                                 ` Kevin Fox
2009-03-19 21:19                                                 ` Eric W. Biederman
     [not found]                                         ` <alpine.LFD.2.00.0903131401070.3940-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-03-14  0:20                                           ` Alexey Dobriyan
2009-03-14  8:25                                             ` Ingo Molnar
     [not found]                                               ` <20090314082532.GB16436-X9Un+BFzKDI@public.gmane.org>
2009-03-14 17:11                                                 ` Joseph Ruscio
2009-03-16  6:01                                               ` Oren Laadan
2009-03-13 20:48                                   ` Mike Waychison
2009-03-13 22:35                                     ` Oren Laadan
2009-03-18 18:54                                       ` Mike Waychison
2009-03-18 19:04                                         ` Oren Laadan
     [not found]                               ` <604427e00903122129y37ad791aq5fe7ef2552415da9-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-03-13 15:27                                 ` Cedric Le Goater
     [not found]                                   ` <49BA7B60.60607-GANU6spQydw@public.gmane.org>
2009-03-13 17:11                                     ` Greg Kurz
2009-03-13 17:37                               ` Serge E. Hallyn
2009-03-13 15:47                         ` Cedric Le Goater
2009-03-13 16:35                           ` Serge E. Hallyn
2009-03-13 16:53                             ` Cedric Le Goater
2009-02-26 16:27                 ` Alexey Dobriyan
2009-02-26 17:33                   ` Ingo Molnar
     [not found]                     ` <20090226173302.GB29439-X9Un+BFzKDI@public.gmane.org>
2009-02-26 18:30                       ` Greg Kurz
2009-02-26 22:17                         ` Alexey Dobriyan
     [not found]                           ` <20090226221709.GA2924-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-02-27  9:19                             ` Greg Kurz
2009-02-27 10:53                               ` Alexey Dobriyan
2009-02-27 14:33                                 ` Cedric Le Goater
2009-02-27  9:36                           ` Cedric Le Goater
2009-02-26 22:31                       ` Alexey Dobriyan
2009-02-27  9:03                         ` Ingo Molnar
2009-02-27  9:19                           ` Andrew Morton
2009-02-27 10:57                             ` Alexey Dobriyan
     [not found]                           ` <20090227090323.GC16211-X9Un+BFzKDI@public.gmane.org>
2009-02-27  9:22                             ` Andrew Morton
2009-02-27 10:59                               ` Alexey Dobriyan
2009-02-27 16:14                         ` Dave Hansen
2009-02-27 21:57                           ` Alexey Dobriyan
     [not found]                             ` <20090227215749.GA3453-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-02-27 21:54                               ` Dave Hansen
     [not found]                         ` <20090226223112.GA2939-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-03-01  1:33                           ` Alexey Dobriyan
     [not found]                             ` <20090301013304.GA2428-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-03-01 20:02                               ` Serge E. Hallyn
     [not found]                                 ` <20090301200231.GA25276-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-03-01 20:56                                   ` Alexey Dobriyan
2009-03-01 22:21                                     ` Serge E. Hallyn
2009-03-03 16:17                                     ` Cedric Le Goater
2009-03-03 18:28                                       ` Serge E. Hallyn
2009-02-13 10:53               ` Ingo Molnar
     [not found]                 ` <20090213105302.GC4608-X9Un+BFzKDI@public.gmane.org>
2009-02-16 20:51                   ` Dave Hansen
2009-02-17 22:23                     ` Ingo Molnar
     [not found]                       ` <20090217222319.GA10546-X9Un+BFzKDI@public.gmane.org>
2009-02-17 22:30                         ` Dave Hansen
2009-02-18  0:32                           ` Ingo Molnar
2009-02-18  0:40                             ` Dave Hansen
2009-02-18  5:11                               ` Alexey Dobriyan
2009-02-18 18:16                                 ` Ingo Molnar
     [not found]                                   ` <20090218181644.GD19995-X9Un+BFzKDI@public.gmane.org>
2009-02-18 21:27                                     ` Dave Hansen
2009-02-18 23:15                                       ` Ingo Molnar
2009-02-19 19:06                                         ` Banning checkpoint (was: Re: What can OpenVZ do?) Alexey Dobriyan
2009-02-19 19:11                                           ` Dave Hansen
2009-02-24  4:47                                             ` Alexey Dobriyan
     [not found]                                               ` <20090224044752.GB3202-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-02-24  5:11                                                 ` Dave Hansen
2009-02-24 15:43                                                   ` Serge E. Hallyn
2009-02-24 20:09                                                   ` Alexey Dobriyan
2009-02-12 22:17             ` What can OpenVZ do? Alexey Dobriyan
2009-02-13 10:27             ` Ingo Molnar
2009-02-13 11:32               ` Alexey Dobriyan
2009-02-13 11:45                 ` Ingo Molnar
2009-02-13 22:28                   ` Alexey Dobriyan
2009-03-14  0:04                     ` Eric W. Biederman
2009-03-14  0:26                       ` Serge E. Hallyn
2009-02-12 22:57         ` [RFC v13][PATCH 00/14] Kernel based checkpoint/restart Dave Hansen
2009-02-12 23:05           ` Matt Mackall
2009-02-12 23:13             ` Dave Hansen
2009-02-13 23:28       ` Andrew Morton
2009-02-14 23:08         ` Ingo Molnar
2009-02-14 23:31           ` Andrew Morton
2009-02-14 23:50             ` Ingo Molnar
     [not found]         ` <20090213152836.0fbbfa7d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-02-16 17:37           ` Dave Hansen
2009-03-13  2:45         ` Oren Laadan
2009-03-13  3:57           ` Oren Laadan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).