[patch 0/5] [RFC] Checkpoint/restore and Elf extension

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 0/5] [RFC] Checkpoint/restore and Elf extension
@ 2011-10-14 11:04 Cyrill Gorcunov
  2011-10-14 11:04 ` [patch 1/5] proc: Introduce the Children: line in /proc/<pid>/status Cyrill Gorcunov
                   ` (4 more replies)
  0 siblings, 5 replies; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-14 11:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Vagin, Pavel Emelyanov, James Bottomley, Glauber Costa,
	H. Peter Anvin, Ingo Molnar, Tejun Heo, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan

Hi all,

it's been known that we're working on checkpoint/restore capability
and the initial RFC series might be found at

      http://permalink.gmane.org/gmane.linux.kernel.containers/

so here is a second attempt to bring some pretty raw and early checkpoint
prototype into discussion.

The first general change from the pervious RFC series is that we're switched to
brand new ptrace-seize facility kernel provides, in a sake of dumping task
memory.

There are still known problems with SEIZE not working with frozen tasks, but
we're working on it.

The second one -- the restore procedure looks like a simple exec() call
from user-space view. Of course this requires some help from kernel side
so we implemented an addition/extension to traditional Elf format.

The main purpose of this set is to discuss whether this "execve the image
in ELF format" approach is suitable for the community.

User space 'crtools' tool can be found here (please checkout at lkml tag
since this is a point when this patch series was sent)

      git://github.com/cyrillos/crtools.git

While the testing was done on top of Linux 3.1-rc3, some patches required
are already included into -mm bundle. So I don't send those patches here,
still they all might be found in crtools source code tree under kernel/
directory (see readme there as well). It includes quilt 'series' file and the all patches
needed in 'all-in-one-series' fashion. The patch 'clone-with-pid-support'
was sent separately for review procedure, so I don't include it into this
patchset as well. And finally the patches for /proc/$pid/map_files are not
included into this bundle too since they are already reviewed and I suppose
will be merged after 3.1, still you can find them under crtools/kernel/
directory.

So to test the features one needs to apply all patches from kernel/ directory,
build the kernel, crtools and do

	crtools -d -t $pid

where the $pid is a pid of process. To restore checkpointed process one need
to run

	crtools -r -t $pid

To see sumped contents one need to type

	crtools -s -t $pid

Note the patch series and crtools itself is rather a very early prototype whith
known problems...

I run tests from test/ directory like (under kvm with CentOS 5.5 on a board)

$ make test
$ test/testee &
[2484]
$ ./crtools -d -t 2484
$ killall -9 testee
$ ./crtools -r -t 2484
    ... here some pings from testee should appear ...

So *ANY* comments, complains, ideas are HIGHLY APPRECIATED!

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [patch 1/5] proc: Introduce the Children: line in /proc/<pid>/status
  2011-10-14 11:04 [patch 0/5] [RFC] Checkpoint/restore and Elf extension Cyrill Gorcunov
@ 2011-10-14 11:04 ` Cyrill Gorcunov
  2011-10-14 16:36   ` Tejun Heo
  2011-10-14 11:04 ` [patch 2/5] fs: Add do_close helper Cyrill Gorcunov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-14 11:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Vagin, Pavel Emelyanov, James Bottomley, Glauber Costa,
	H. Peter Anvin, Ingo Molnar, Tejun Heo, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan, Serge Hallyn,
	Cyrill Gorcunov

[-- Attachment #1: cr-proc-add-children --]
[-- Type: text/plain, Size: 1374 bytes --]

From: Pavel Emelyanov <xemul@parallels.com>

Although we can get the pids of some task's issue, this is just 
more convenient to have them this way.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 fs/proc/array.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -158,6 +158,18 @@ static inline const char *get_task_state
 	return *p;
 }
 
+static void task_children(struct seq_file *m, struct task_struct *p, struct pid_namespace *ns)
+{
+	struct task_struct *c;
+
+	seq_printf(m, "Children:");
+	read_lock(&tasklist_lock);
+	list_for_each_entry(c, &p->children, sibling)
+		seq_printf(m, " %d", pid_nr_ns(task_pid(c), ns));
+	read_unlock(&tasklist_lock);
+	seq_putc(m, '\n');
+}
+
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
 				struct pid *pid, struct task_struct *p)
 {
@@ -192,6 +204,8 @@ static inline void task_state(struct seq
 		cred->uid, cred->euid, cred->suid, cred->fsuid,
 		cred->gid, cred->egid, cred->sgid, cred->fsgid);
 
+	task_children(m, p, ns);
+
 	task_lock(p);
 	if (p->files)
 		fdt = files_fdtable(p->files);


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [patch 2/5] fs: Add do_close helper
  2011-10-14 11:04 [patch 0/5] [RFC] Checkpoint/restore and Elf extension Cyrill Gorcunov
  2011-10-14 11:04 ` [patch 1/5] proc: Introduce the Children: line in /proc/<pid>/status Cyrill Gorcunov
@ 2011-10-14 11:04 ` Cyrill Gorcunov
  2011-10-14 11:04 ` [patch 3/5] fs, proc: Add /proc/$pid/tls entry Cyrill Gorcunov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-14 11:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Vagin, Pavel Emelyanov, James Bottomley, Glauber Costa,
	H. Peter Anvin, Ingo Molnar, Tejun Heo, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Cyrill Gorcunov

[-- Attachment #1: fs-add-do-close --]
[-- Type: text/plain, Size: 2480 bytes --]

To be able to close file descriptors right from inside
kernel space do_close() helper is added.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 fs/open.c          |   32 ++++++++++++++++++++------------
 include/linux/fs.h |    1 +
 2 files changed, 21 insertions(+), 12 deletions(-)

Index: linux-2.6.git/fs/open.c
===================================================================
--- linux-2.6.git.orig/fs/open.c
+++ linux-2.6.git/fs/open.c
@@ -1056,17 +1056,11 @@ int filp_close(struct file *filp, fl_own
 
 EXPORT_SYMBOL(filp_close);
 
-/*
- * Careful here! We test whether the file pointer is NULL before
- * releasing the fd. This ensures that one clone task can't release
- * an fd while another clone is opening it.
- */
-SYSCALL_DEFINE1(close, unsigned int, fd)
+int do_close(unsigned int fd)
 {
 	struct file * filp;
 	struct files_struct *files = current->files;
 	struct fdtable *fdt;
-	int retval;
 
 	spin_lock(&files->file_lock);
 	fdt = files_fdtable(files);
@@ -1079,7 +1073,25 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 	FD_CLR(fd, fdt->close_on_exec);
 	__put_unused_fd(files, fd);
 	spin_unlock(&files->file_lock);
-	retval = filp_close(filp, files);
+
+	return filp_close(filp, files);
+
+out_unlock:
+	spin_unlock(&files->file_lock);
+	return -EBADF;
+}
+EXPORT_SYMBOL_GPL(do_close);
+
+/*
+ * Careful here! We test whether the file pointer is NULL before
+ * releasing the fd. This ensures that one clone task can't release
+ * an fd while another clone is opening it.
+ */
+SYSCALL_DEFINE1(close, unsigned int, fd)
+{
+	int retval;
+
+	retval = do_close(fd);
 
 	/* can't restart close syscall because file table entry was cleared */
 	if (unlikely(retval == -ERESTARTSYS ||
@@ -1089,10 +1101,6 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 		retval = -EINTR;
 
 	return retval;
-
-out_unlock:
-	spin_unlock(&files->file_lock);
-	return -EBADF;
 }
 EXPORT_SYMBOL(sys_close);
 
Index: linux-2.6.git/include/linux/fs.h
===================================================================
--- linux-2.6.git.orig/include/linux/fs.h
+++ linux-2.6.git/include/linux/fs.h
@@ -2025,6 +2025,7 @@ extern struct file *file_open_root(struc
 extern struct file * dentry_open(struct dentry *, struct vfsmount *, int,
 				 const struct cred *);
 extern int filp_close(struct file *, fl_owner_t id);
+extern int do_close(unsigned int fd);
 extern char * getname(const char __user *);
 
 /* fs/ioctl.c */


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [patch 3/5] fs, proc: Add /proc/$pid/tls entry
  2011-10-14 11:04 [patch 0/5] [RFC] Checkpoint/restore and Elf extension Cyrill Gorcunov
  2011-10-14 11:04 ` [patch 1/5] proc: Introduce the Children: line in /proc/<pid>/status Cyrill Gorcunov
  2011-10-14 11:04 ` [patch 2/5] fs: Add do_close helper Cyrill Gorcunov
@ 2011-10-14 11:04 ` Cyrill Gorcunov
  2011-10-14 16:40   ` Tejun Heo
  2011-10-14 11:04 ` [patch 4/5] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat Cyrill Gorcunov
  2011-10-14 11:04 ` [patch 5/5] elf: Add support for loading ET_CKPT files Cyrill Gorcunov
  4 siblings, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-14 11:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Vagin, Pavel Emelyanov, James Bottomley, Glauber Costa,
	H. Peter Anvin, Ingo Molnar, Tejun Heo, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Cyrill Gorcunov

[-- Attachment #1: fs-proc-add-tls --]
[-- Type: text/plain, Size: 1372 bytes --]

To be able to restart checkpointed tasks we need
to know TLS status at dumping time. Export this
information by /proc/$pid/tls entry.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 fs/proc/base.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Index: linux-2.6.git/fs/proc/base.c
===================================================================
--- linux-2.6.git.orig/fs/proc/base.c
+++ linux-2.6.git/fs/proc/base.c
@@ -3150,6 +3150,23 @@ static int proc_pid_personality(struct s
 	return err;
 }
 
+#ifdef CONFIG_X86
+static int proc_pid_tls(struct seq_file *m, struct pid_namespace *ns,
+			struct pid *pid, struct task_struct *task)
+{
+	int err = lock_trace(task);
+	if (!err) {
+		int i;
+		for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
+			seq_printf(m, "%x %x\n",
+				   task->thread.tls_array[i].a,
+				   task->thread.tls_array[i].b);
+		unlock_trace(task);
+	}
+	return err;
+}
+#endif
+
 /*
  * Thread groups
  */
@@ -3169,6 +3186,9 @@ static const struct pid_entry tgid_base_
 	INF("auxv",       S_IRUSR, proc_pid_auxv),
 	ONE("status",     S_IRUGO, proc_pid_status),
 	ONE("personality", S_IRUGO, proc_pid_personality),
+#ifdef CONFIG_X86
+	ONE("tls",        S_IRUGO, proc_pid_tls),
+#endif
 	INF("limits",	  S_IRUGO, proc_pid_limits),
 #ifdef CONFIG_SCHED_DEBUG
 	REG("sched",      S_IRUGO|S_IWUSR, proc_pid_sched_operations),


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [patch 4/5] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat
  2011-10-14 11:04 [patch 0/5] [RFC] Checkpoint/restore and Elf extension Cyrill Gorcunov
                   ` (2 preceding siblings ...)
  2011-10-14 11:04 ` [patch 3/5] fs, proc: Add /proc/$pid/tls entry Cyrill Gorcunov
@ 2011-10-14 11:04 ` Cyrill Gorcunov
  2011-10-14 11:04 ` [patch 5/5] elf: Add support for loading ET_CKPT files Cyrill Gorcunov
  4 siblings, 0 replies; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-14 11:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Vagin, Pavel Emelyanov, James Bottomley, Glauber Costa,
	H. Peter Anvin, Ingo Molnar, Tejun Heo, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Cyrill Gorcunov

[-- Attachment #1: fs-proc-add-mm-task-stat --]
[-- Type: text/plain, Size: 1165 bytes --]

It helps to dump and restore this mm_struct members at chekpoint/restore time.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
---
 fs/proc/array.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6.git/fs/proc/array.c
===================================================================
--- linux-2.6.git.orig/fs/proc/array.c
+++ linux-2.6.git/fs/proc/array.c
@@ -478,7 +478,7 @@ static int do_task_stat(struct seq_file 
 
 	seq_printf(m, "%d (%s) %c %d %d %d %d %d %u %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu %lu %ld %lu %lu %lu\n",
 		pid_nr_ns(pid, ns),
 		tcomm,
 		state,
@@ -525,7 +525,10 @@ static int do_task_stat(struct seq_file 
 		task->policy,
 		(unsigned long long)delayacct_blkio_ticks(task),
 		cputime_to_clock_t(gtime),
-		cputime_to_clock_t(cgtime));
+		cputime_to_clock_t(cgtime),
+		mm ? (permitted ? mm->start_data : 1) : 0,
+		mm ? (permitted ? mm->end_data : 1) : 0,
+		mm ? (permitted ? mm->start_brk : 1) : 0);
 	if (mm)
 		mmput(mm);
 	return 0;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-14 11:04 [patch 0/5] [RFC] Checkpoint/restore and Elf extension Cyrill Gorcunov
                   ` (3 preceding siblings ...)
  2011-10-14 11:04 ` [patch 4/5] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat Cyrill Gorcunov
@ 2011-10-14 11:04 ` Cyrill Gorcunov
  2011-10-14 17:10   ` Tejun Heo
  4 siblings, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-14 11:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Vagin, Pavel Emelyanov, James Bottomley, Glauber Costa,
	H. Peter Anvin, Ingo Molnar, Tejun Heo, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Cyrill Gorcunov

[-- Attachment #1: binfmt-elf-for-cr-5 --]
[-- Type: text/plain, Size: 26381 bytes --]

This patch add ability to run that named "checkpoint" files by
enhancing Elf file format, which includes

 - new Elf file type ET_CKPT

 - three additional program header types PT_CKPT_VMA, PT_CKPT_CORE
   and PT_CKPT_PAGES.

     PT_CKPT_VMA -- holds 'vma_entry' structure, which describes the
     memory area the kernel should map. It also might contain a file descriptor
     so the kernel will be mapping a file povided. Usually such file get
     opened by user-space helper tool which prepares 'vma_entry' structure
     for the kernel.

     PT_CKPT_CORE -- 'core_entry' structure (registers, tls, tasks specific
     settings). The structure is defined as a 16K container which should be
     enough for most cases. 8K of it is reserved for arch specific settings.

     PT_CKPT_PAGES -- a set of all pages which contents we should restored.

Apart from Elf extension flush_old_exec() has been splitted to two
functions -- the former flush_old_exec() and flush_exec_keep_thread().
The later doesn't call for de_thread() allowing to keep threads
relationship. Also arch_setup_additional_pages_at() helper added
to setup vdso at predefined address.

At moment only pure x86-64 architecture is supported.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Andrew Vagin <avagin@parallels.com>
CC: Pavel Emelyanov <xemul@parallels.com>
CC: James Bottomley <jbottomley@parallels.com>
CC: Glauber Costa <glommer@parallels.com>
CC: H. Peter Anvin <hpa@zytor.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Tejun Heo <tj@kernel.org>
CC: Dave Hansen <dave@linux.vnet.ibm.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: Daniel Lezcano <dlezcano@fr.ibm.com>
CC: Alexey Dobriyan <adobriyan@gmail.com>
---
 arch/x86/include/asm/elf.h      |    3 
 arch/x86/include/asm/elf_ckpt.h |   80 ++++++++
 arch/x86/kernel/Makefile        |    2 
 arch/x86/kernel/elf_ckpt.c      |  161 ++++++++++++++++++
 arch/x86/vdso/vma.c             |   22 ++
 fs/Kconfig.binfmt               |   11 +
 fs/Makefile                     |    1 
 fs/binfmt_elf.c                 |   17 +
 fs/binfmt_elf_ckpt.c            |  356 ++++++++++++++++++++++++++++++++++++++++
 fs/exec.c                       |   27 +--
 include/linux/binfmts.h         |    1 
 include/linux/elf_ckpt.h        |  103 +++++++++++
 12 files changed, 772 insertions(+), 12 deletions(-)

Index: linux-2.6.git/arch/x86/include/asm/elf.h
===================================================================
--- linux-2.6.git.orig/arch/x86/include/asm/elf.h
+++ linux-2.6.git/arch/x86/include/asm/elf.h
@@ -314,7 +314,8 @@ struct linux_binprm;
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 				       int uses_interp);
-
+extern int arch_setup_additional_pages_at(struct linux_binprm *bprm,
+					  void *addr, int uses_interp);
 extern int syscall32_setup_pages(struct linux_binprm *, int exstack);
 #define compat_arch_setup_additional_pages	syscall32_setup_pages
 
Index: linux-2.6.git/arch/x86/include/asm/elf_ckpt.h
===================================================================
--- /dev/null
+++ linux-2.6.git/arch/x86/include/asm/elf_ckpt.h
@@ -0,0 +1,80 @@
+#ifndef _LINUX_ELF_X86_CHECKPOINT_H
+#define _LINUX_ELF_X86_CHECKPOINT_H
+
+#include <linux/errno.h>
+
+#include <asm/types.h>
+#include <asm/ptrace.h>
+
+#define CKPT_GDT_ENTRY_TLS_ENTRIES	3
+
+struct user_regs_entry {
+	__u64	r15;
+	__u64	r14;
+	__u64	r13;
+	__u64	r12;
+	__u64	bp;
+	__u64	bx;
+	__u64	r11;
+	__u64	r10;
+	__u64	r9;
+	__u64	r8;
+	__u64	ax;
+	__u64	cx;
+	__u64	dx;
+	__u64	si;
+	__u64	di;
+	__u64	orig_ax;
+	__u64	ip;
+	__u64	cs;
+	__u64	flags;
+	__u64	sp;
+	__u64	ss;
+	__u64	fs_base;
+	__u64	gs_base;
+	__u64	ds;
+	__u64	es;
+	__u64	fs;
+	__u64	gs;
+} __packed;
+
+struct desc_struct_entry {
+	__u32	a;
+	__u32	b;
+} __packed;
+
+struct user_fpregs_entry {
+	__u16	cwd;
+	__u16	swd;
+	__u16	twd;
+	__u16	fop;
+	__u64	rip;
+	__u64	rdp;
+	__u32	mxcsr;
+	__u32	mxcsr_mask;
+	__u32	st_space[32];
+	__u32	xmm_space[64];
+	__u32	padding[24];
+} __packed;
+
+struct ckpt_arch_entry {
+	struct user_regs_entry		gpregs;
+	struct user_fpregs_entry	fpregs;
+	struct desc_struct		tls_array[CKPT_GDT_ENTRY_TLS_ENTRIES];
+};
+
+struct core_entry;
+
+#ifdef CONFIG_X86_64
+extern int load_elf_ckpt_arch(struct task_struct *tsk, struct pt_regs *regs,
+			      struct core_entry *core_entry);
+#else
+static inline int
+load_elf_ckpt_arch(struct task_struct *tsk, struct pt_regs *regs,
+		   struct core_entry *core_entry)
+{
+	return -ENOEXEC;
+}
+#endif
+
+#endif /* _LINUX_ELF_X86_CHECKPOINT_H */
Index: linux-2.6.git/arch/x86/kernel/Makefile
===================================================================
--- linux-2.6.git.orig/arch/x86/kernel/Makefile
+++ linux-2.6.git/arch/x86/kernel/Makefile
@@ -99,6 +99,8 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) 
 obj-$(CONFIG_SWIOTLB)			+= pci-swiotlb.o
 obj-$(CONFIG_OF)			+= devicetree.o
 
+obj-$(CONFIG_BINFMT_ELF_CKPT)		+= elf_ckpt.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
Index: linux-2.6.git/arch/x86/kernel/elf_ckpt.c
===================================================================
--- /dev/null
+++ linux-2.6.git/arch/x86/kernel/elf_ckpt.c
@@ -0,0 +1,161 @@
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/errno.h>
+#include <linux/signal.h>
+#include <linux/binfmts.h>
+#include <linux/string.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/personality.h>
+#include <linux/elfcore.h>
+#include <linux/init.h>
+#include <linux/highuid.h>
+#include <linux/compiler.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <linux/security.h>
+#include <linux/random.h>
+#include <linux/elf.h>
+#include <linux/utsname.h>
+#include <linux/coredump.h>
+#include <linux/regset.h>
+
+#include <asm/uaccess.h>
+#include <asm/param.h>
+#include <asm/page.h>
+#include <asm/prctl.h>
+#include <asm/proto.h>
+#include <asm/i387.h>
+
+#include <linux/elf_ckpt.h>
+#include <linux/flex_array.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+
+#ifdef CONFIG_X86_64
+
+#define cp_reg(d, s, r) d.r = s.r
+
+int load_elf_ckpt_arch(struct task_struct *tsk, struct pt_regs *regs,
+		       struct core_entry *core_entry)
+{
+	struct ckpt_arch_entry *arch = (struct ckpt_arch_entry *)core_entry->arch;
+	struct thread_struct *thread = &current->thread;
+
+	struct user_regs_struct gpregs;
+	struct user_i387_struct	fpregs;
+
+	mm_segment_t old_fs;
+	int i, ret;
+
+	if (core_entry->header.arch != CKPT_HEADER_ARCH_X86_64) {
+		pr_err("elf-ckpt-x86: Unsupported or corrupted header\n");
+		return -ENOEXEC;
+	}
+
+	BUILD_BUG_ON(CKPT_GDT_ENTRY_TLS_ENTRIES != GDT_ENTRY_TLS_ENTRIES);
+	BUILD_BUG_ON(sizeof(struct ckpt_arch_entry) > CKPT_ARCH_SIZE);
+
+	memset(&gpregs, 0, sizeof(gpregs));
+	memset(&fpregs, 0, sizeof(fpregs));
+
+	/*
+	 * General purpose registers
+	 */
+	cp_reg(gpregs, arch->gpregs, r15);
+	cp_reg(gpregs, arch->gpregs, r14);
+	cp_reg(gpregs, arch->gpregs, r13);
+	cp_reg(gpregs, arch->gpregs, r12);
+	cp_reg(gpregs, arch->gpregs, bp);
+	cp_reg(gpregs, arch->gpregs, bx);
+	cp_reg(gpregs, arch->gpregs, r11);
+	cp_reg(gpregs, arch->gpregs, r10);
+	cp_reg(gpregs, arch->gpregs, r9);
+	cp_reg(gpregs, arch->gpregs, r8);
+	cp_reg(gpregs, arch->gpregs, ax);
+	cp_reg(gpregs, arch->gpregs, cx);
+	cp_reg(gpregs, arch->gpregs, dx);
+	cp_reg(gpregs, arch->gpregs, si);
+	cp_reg(gpregs, arch->gpregs, di);
+	cp_reg(gpregs, arch->gpregs, orig_ax);
+	cp_reg(gpregs, arch->gpregs, ip);
+	cp_reg(gpregs, arch->gpregs, cs);
+	cp_reg(gpregs, arch->gpregs, flags);
+	cp_reg(gpregs, arch->gpregs, sp);
+	cp_reg(gpregs, arch->gpregs, ss);
+	cp_reg(gpregs, arch->gpregs, fs_base);
+	cp_reg(gpregs, arch->gpregs, gs_base);
+	cp_reg(gpregs, arch->gpregs, ds);
+	cp_reg(gpregs, arch->gpregs, es);
+	cp_reg(gpregs, arch->gpregs, fs);
+	cp_reg(gpregs, arch->gpregs, gs);
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+	ret = arch_ptrace(current, PTRACE_SETREGS, 0, (unsigned long)&gpregs);
+	set_fs(old_fs);
+	if (ret)
+		goto out;
+
+	*regs = *task_pt_regs(current);
+
+	thread->usersp	= arch->gpregs.sp;
+	thread->ds	= arch->gpregs.ds;
+	thread->es	= arch->gpregs.es;
+	thread->fs	= arch->gpregs.fs;
+	thread->gs	= arch->gpregs.gs;
+
+	thread->fsindex	= thread->fs;
+	thread->gsindex = thread->gs;
+
+	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
+		thread->tls_array[i].a = arch->tls_array[i].a;
+		thread->tls_array[i].b = arch->tls_array[i].b;
+	}
+
+	if (arch->gpregs.fs_base) {
+		ret = do_arch_prctl(current, ARCH_SET_FS, arch->gpregs.fs_base);
+		if (ret)
+			goto out;
+	}
+
+	if (arch->gpregs.gs_base) {
+		ret = do_arch_prctl(current, ARCH_SET_GS, arch->gpregs.gs_base);
+		if (ret)
+			goto out;
+	}
+
+	/* Restoring FPU */
+	if (core_entry->task_flags & PF_USED_MATH) {
+
+		cp_reg(fpregs, arch->fpregs, cwd);
+		cp_reg(fpregs, arch->fpregs, swd);
+		cp_reg(fpregs, arch->fpregs, twd);
+		cp_reg(fpregs, arch->fpregs, fop);
+		cp_reg(fpregs, arch->fpregs, rip);
+		cp_reg(fpregs, arch->fpregs, rdp);
+		cp_reg(fpregs, arch->fpregs, mxcsr);
+		cp_reg(fpregs, arch->fpregs, mxcsr_mask);
+
+		for (i = 0; i < ARRAY_SIZE(arch->fpregs.st_space); i++)
+			cp_reg(fpregs, arch->fpregs, st_space[i]);
+
+		for (i = 0; i < ARRAY_SIZE(arch->fpregs.xmm_space); i++)
+			cp_reg(fpregs, arch->fpregs, xmm_space[i]);
+
+		old_fs = get_fs();
+		set_fs(KERNEL_DS);
+		ret = arch_ptrace(current, PTRACE_SETFPREGS, 0, (unsigned long)&fpregs);
+		set_fs(old_fs);
+		if (ret)
+			goto out;
+	}
+
+out:
+	return ret;
+}
+
+#endif /* CONFIG_X86_64 */
Index: linux-2.6.git/arch/x86/vdso/vma.c
===================================================================
--- linux-2.6.git.orig/arch/x86/vdso/vma.c
+++ linux-2.6.git/arch/x86/vdso/vma.c
@@ -137,6 +137,28 @@ up_fail:
 	return ret;
 }
 
+int arch_setup_additional_pages_at(struct linux_binprm *bprm, void *addr, int uses_interp)
+{
+	struct mm_struct *mm = current->mm;
+	int ret;
+
+	if (!vdso_enabled)
+		return 0;
+
+	down_write(&mm->mmap_sem);
+	current->mm->context.vdso = addr;
+	ret = install_special_mapping(mm, (unsigned long)addr, vdso_size,
+				      VM_READ | VM_EXEC |
+				      VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC |
+				      VM_ALWAYSDUMP,
+				      vdso_pages);
+	if (ret)
+		current->mm->context.vdso = NULL;
+
+	up_write(&mm->mmap_sem);
+	return ret;
+}
+
 static __init int vdso_setup(char *s)
 {
 	vdso_enabled = simple_strtoul(s, NULL, 0);
Index: linux-2.6.git/fs/Kconfig.binfmt
===================================================================
--- linux-2.6.git.orig/fs/Kconfig.binfmt
+++ linux-2.6.git/fs/Kconfig.binfmt
@@ -23,6 +23,17 @@ config BINFMT_ELF
 	  ld.so (check the file <file:Documentation/Changes> for location and
 	  latest version).
 
+config BINFMT_ELF_CKPT
+	tristate "Kernel support for CKPT ELF binaries"
+	default n
+	depends on BINFMT_ELF && X86_64
+	help
+	  ELF CKPT (checkpoint) is an extension to ELF format to restore
+	  checkpointed processes. It's not confirmed yet and highly
+	  experimental.
+
+	  If unsure, say N.
+
 config COMPAT_BINFMT_ELF
 	bool
 	depends on COMPAT && BINFMT_ELF
Index: linux-2.6.git/fs/Makefile
===================================================================
--- linux-2.6.git.orig/fs/Makefile
+++ linux-2.6.git/fs/Makefile
@@ -37,6 +37,7 @@ obj-$(CONFIG_BINFMT_MISC)	+= binfmt_misc
 obj-y				+= binfmt_script.o
 
 obj-$(CONFIG_BINFMT_ELF)	+= binfmt_elf.o
+obj-$(CONFIG_BINFMT_ELF_CKPT)	+= binfmt_elf_ckpt.o
 obj-$(CONFIG_COMPAT_BINFMT_ELF)	+= compat_binfmt_elf.o
 obj-$(CONFIG_BINFMT_ELF_FDPIC)	+= binfmt_elf_fdpic.o
 obj-$(CONFIG_BINFMT_SOM)	+= binfmt_som.o
Index: linux-2.6.git/fs/binfmt_elf.c
===================================================================
--- linux-2.6.git.orig/fs/binfmt_elf.c
+++ linux-2.6.git/fs/binfmt_elf.c
@@ -30,6 +30,7 @@
 #include <linux/security.h>
 #include <linux/random.h>
 #include <linux/elf.h>
+#include <linux/elf_ckpt.h>
 #include <linux/utsname.h>
 #include <linux/coredump.h>
 #include <asm/uaccess.h>
@@ -592,7 +593,11 @@ static int load_elf_binary(struct linux_
 	if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
 		goto out;
 
-	if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
+	if (loc->elf_ex.e_type != ET_EXEC &&
+#ifdef CONFIG_BINFMT_ELF_CKPT
+	    loc->elf_ex.e_type != ET_CKPT  &&
+#endif
+	    loc->elf_ex.e_type != ET_DYN)
 		goto out;
 	if (!elf_check_arch(&loc->elf_ex))
 		goto out;
@@ -619,6 +624,16 @@ static int load_elf_binary(struct linux_
 		goto out_free_ph;
 	}
 
+#ifdef CONFIG_BINFMT_ELF_CKPT
+	if (loc->elf_ex.e_type == ET_CKPT) {
+		retval = load_elf_ckpt(bprm, regs, &loc->elf_ex,
+				       (struct elf_phdr *)elf_phdata);
+		if (!retval)
+			set_binfmt(&elf_format);
+		goto out_free_ph;
+	}
+#endif
+
 	elf_ppnt = elf_phdata;
 	elf_bss = 0;
 	elf_brk = 0;
Index: linux-2.6.git/fs/binfmt_elf_ckpt.c
===================================================================
--- /dev/null
+++ linux-2.6.git/fs/binfmt_elf_ckpt.c
@@ -0,0 +1,356 @@
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/errno.h>
+#include <linux/signal.h>
+#include <linux/binfmts.h>
+#include <linux/string.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/personality.h>
+#include <linux/elfcore.h>
+#include <linux/init.h>
+#include <linux/highuid.h>
+#include <linux/compiler.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <linux/security.h>
+#include <linux/random.h>
+#include <linux/elf.h>
+#include <linux/utsname.h>
+#include <linux/coredump.h>
+#include <linux/regset.h>
+
+#include <asm/uaccess.h>
+#include <asm/param.h>
+#include <asm/page.h>
+#include <asm/prctl.h>
+#include <asm/proto.h>
+#include <asm/i387.h>
+
+#include <linux/elf_ckpt.h>
+#include <asm/elf_ckpt.h>
+
+#include <linux/flex_array.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+
+int load_elf_ckpt(struct linux_binprm *bprm, struct pt_regs *regs,
+		  struct elfhdr *elf_ex, struct elf_phdr *elf_phdr)
+{
+	struct elf_phdr *elf_phdr_pages;
+	struct flex_array *fa = NULL;
+	struct vma_entry *vma_entry_ptr;
+	int nr_vma_found, nr_vma_mapped;
+	struct vma_entry vma_entry;
+	struct file *file = NULL;
+	unsigned long map_addr;
+
+#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
+	unsigned long vdso = -1UL;
+#endif
+
+	struct core_entry *core_entry = NULL;
+	unsigned long start_stack = -1UL;
+
+	int i, ret = -ENOEXEC;
+	loff_t off;
+
+	BUILD_BUG_ON(CKPT_TASK_COMM_LEN != TASK_COMM_LEN);
+	BUILD_BUG_ON(CKPT_PAGE_SIZE != PAGE_SIZE);
+	BUILD_BUG_ON(CKPT_CORE_SIZE != sizeof(*core_entry));
+
+	elf_phdr_pages	= NULL;
+	nr_vma_found	= 0;
+	nr_vma_mapped	= 0;
+
+	/*
+	 * An early check for header version so if we fail here
+	 * we would not need to use flex array at all.
+	 */
+	for (i = 0; i < elf_ex->e_phnum; i++) {
+		if (elf_phdr[i].p_type != PT_CKPT_CORE)
+			continue;
+
+		core_entry = vmalloc(sizeof(*core_entry));
+		if (!core_entry) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		ret = kernel_read(bprm->file, elf_phdr[i].p_offset,
+				  (char *)core_entry, sizeof(*core_entry));
+		if (ret != sizeof(*core_entry)) {
+			pr_err("elf-ckpt: Can't read core_entry\n");
+			ret = -EIO;
+			goto out;
+		}
+
+		if (core_entry->header.version != CKPT_HEADER_VERSION) {
+			pr_err("elf-ckpt: Unsupported or corrupted header\n");
+			ret = -ENOEXEC;
+			goto out;
+		}
+
+		break;
+	}
+
+	if (i == elf_ex->e_phnum) {
+		pr_err("elf-ckpt: No header found\n");
+		ret = -ENOEXEC;
+		goto out;
+	}
+
+
+	fa = flex_array_alloc(sizeof(vma_entry), elf_ex->e_phnum, GFP_KERNEL);
+	if (!fa || flex_array_prealloc(fa, 0, elf_ex->e_phnum, GFP_KERNEL)) {
+		ret = -ENOMEM;
+		if (fa) {
+			flex_array_free(fa);
+			fa = NULL;
+			goto out;
+		}
+	}
+
+	ret = flush_exec_keep_thread(bprm);
+	if (ret)
+		goto out;
+
+	current->flags &= ~PF_FORKNOEXEC;
+	current->mm->def_flags = 0;
+
+	/*
+	 * We don't care about parameters passed (such as argc, argv, env)
+	 * when execute checkpoint file because we're to substitute
+	 * all things anyway.
+	 */
+	do_munmap(current->mm, 0, TASK_SIZE);
+
+	SET_PERSONALITY(loc->elf_ex);
+
+	for (i = 0; i < elf_ex->e_phnum; i++) {
+
+		switch (elf_phdr[i].p_type) {
+		case PT_CKPT_VMA:
+			ret = kernel_read(bprm->file, elf_phdr[i].p_offset,
+					  (char *)&vma_entry, sizeof(vma_entry));
+			if (ret != sizeof(vma_entry)) {
+				pr_err("elf-ckpt: Can't read vma_entry\n");
+				ret = -EIO;
+				goto out;
+			}
+			if (flex_array_put(fa, i, &vma_entry, GFP_KERNEL))
+				BUG();
+
+			/* We need to know if there is executable stack */
+			if (vma_entry.status & VMA_AREA_STACK) {
+				if (vma_entry.flags & PROT_EXEC)
+					current->personality |= READ_IMPLIES_EXEC;
+			}
+
+			nr_vma_found++;
+			continue;
+		case PT_CKPT_PAGES:
+			elf_phdr_pages = &elf_phdr[i];
+			continue;
+		default:
+			continue;
+		}
+	}
+
+	/* Be sure it has the file structure we expected to see. */
+	if (!elf_phdr_pages || !nr_vma_found) {
+		ret = -ENOEXEC;
+		goto out;
+	}
+
+	/*
+	 * VMA randomization still needs to be set (just in case if
+	 * the program we restore will exec() something else later).
+	 */
+	if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
+		current->flags |= PF_RANDOMIZE;
+
+	/*
+	 * FIXME: Note it flushes signal handlers as well,
+	 * so we need to dump queued signals and restore
+	 * them here.
+	 */
+	setup_new_exec(bprm);
+
+	current->mm->free_area_cache = current->mm->mmap_base;
+	current->mm->cached_hole_size = 0;
+
+	for (i = 0; i < nr_vma_found; i++) {
+		vma_entry_ptr = flex_array_get(fa, i);
+
+#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
+		if (vma_entry_ptr->status & VMA_AREA_VDSO)
+			vdso = vma_entry_ptr->start;
+#endif
+
+		if (vma_entry_ptr->status & VMA_AREA_STACK) {
+			/* Note if stack is VM_GROWSUP -- it should be reversed */
+			start_stack = vma_entry_ptr->start;
+		}
+
+		/* Anything special should be ignored */
+		if (!(vma_entry_ptr->status & VMA_AREA_REGULAR))
+			continue;
+
+		/* It's a file mmap'ed */
+		if (vma_entry_ptr->fd != -1) {
+			file = fget((unsigned int)vma_entry_ptr->fd);
+			if (!file) {
+				ret = -EBADF;
+				goto out_unmap;
+			}
+
+			/* Reuse this field to handle error cases */
+			vma_entry_ptr->fd = (__u64)file;
+		} else
+			file = NULL;
+
+		down_write(&current->mm->mmap_sem);
+		map_addr = do_mmap(file,
+				   vma_entry_ptr->start,
+				   vma_entry_ptr->end - vma_entry_ptr->start,
+				   vma_entry_ptr->prot,
+				   vma_entry_ptr->flags | MAP_FIXED,
+				   vma_entry_ptr->pgoff);
+		up_write(&current->mm->mmap_sem);
+
+		if (file) {
+			fput(file);
+			do_close((unsigned int)vma_entry_ptr->fd);
+		}
+
+		if ((unsigned long)(map_addr) >= TASK_SIZE) {
+			ret = IS_ERR((void *)map_addr) ? PTR_ERR((void*)map_addr) : -EINVAL;
+			goto out_unmap;
+		}
+
+		nr_vma_mapped++;
+	}
+
+#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
+	if (vdso == -1UL) {
+		pr_err("elf-ckpt: Can't find VDSO address\n");
+		ret = -ENOEXEC;
+		goto out_unmap;
+	}
+#endif
+
+	if (start_stack == -1UL) {
+		pr_err("elf-ckpt: Can't find stack VMA\n");
+		ret = -ENOEXEC;
+		goto out_unmap;
+	}
+
+	/* The name it has before */
+	set_task_comm(current, core_entry->task_comm);
+
+	bprm->p = core_entry->mm_start_stack;
+
+	current->mm->start_code		= core_entry->mm_start_code;
+	current->mm->end_code		= core_entry->mm_end_code;
+	current->mm->start_data		= core_entry->mm_start_data;
+	current->mm->end_data		= core_entry->mm_end_data;
+	current->mm->start_stack	= core_entry->mm_start_stack;
+	current->mm->start_brk		= core_entry->mm_start_brk;
+	current->mm->brk		= core_entry->mm_brk;
+
+#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
+	ret = arch_setup_additional_pages_at(bprm, (void *)vdso, 0);
+	if (ret) {
+		pr_err("elf-ckpt: Can't setup additional pages at %lx with %d\n",
+			vdso, ret);
+		goto out_unmap;
+	}
+#endif
+
+	/*
+	 * Restore pages
+	 */
+	off = elf_phdr_pages->p_offset;
+	while (1) {
+		struct vm_area_struct *vma;
+		struct page *page;
+		void *page_data;
+		__u64 va;
+
+		ret = kernel_read(bprm->file, off, (char *)&va, sizeof(va));
+		if (ret != sizeof(va)) {
+			pr_err("elf-ckpt: Can't read page virtual address: "
+			       "ret = %d off = %lx\n", ret, (unsigned long)off);
+			ret = -EIO;
+			goto out_unmap;
+		}
+
+		/* End of pages reached */
+		if (!va)
+			break;
+
+		vma = find_vma(current->mm, (unsigned long)va);
+		if (!vma) {
+			pr_err("elf-ckpt: No VMA for page: %16lx\n", (unsigned long)va);
+			ret = -ESRCH;
+			goto out_unmap;
+		}
+
+		ret = get_user_pages(current, current->mm, (unsigned long)va,
+				     1, 1, 1, &page, NULL);
+		if (ret != 1) {
+			pr_err("elf-ckpt: Can't get user page: %16lx\n", (unsigned long)va);
+			ret = -EFAULT;
+			goto out_unmap;
+		}
+
+		page_data = kmap(page);
+		ret = kernel_read(bprm->file, off + sizeof(va), page_data, PAGE_SIZE);
+		kunmap(page);
+		put_page(page);
+
+		if (ret != PAGE_SIZE) {
+			pr_err("elf-ckpt: Can't read data on page: %16lx\n", (unsigned long)va);
+			ret = -EFAULT;
+			goto out_unmap;
+		}
+
+		off += sizeof(va) + PAGE_SIZE;
+	}
+
+	/*
+	 * Architecture specific setup for registers
+	 * and friends, it's done lately since if
+	 * an error happened before there is no much
+	 * point to setup arch-specific things at all.
+	*/
+	ret = load_elf_ckpt_arch(current, regs, core_entry);
+	if (ret)
+		goto out_unmap;
+
+	/* We're done */
+	ret = 0;
+out:
+	if (core_entry)
+		vfree(core_entry);
+
+	if (fa)
+		flex_array_free(fa);
+	return ret;
+
+out_unmap:
+	for (i = 0; i < nr_vma_mapped; i++) {
+		vma_entry_ptr = flex_array_get(fa, i);
+		down_write(&current->mm->mmap_sem);
+		do_munmap(current->mm, vma_entry_ptr->start,
+			  vma_entry_ptr->end - vma_entry_ptr->start);
+		up_write(&current->mm->mmap_sem);
+	}
+
+	send_sig(SIGKILL, current, 0);
+	goto out;
+}
Index: linux-2.6.git/fs/exec.c
===================================================================
--- linux-2.6.git.orig/fs/exec.c
+++ linux-2.6.git/fs/exec.c
@@ -1071,18 +1071,10 @@ void set_task_comm(struct task_struct *t
 	perf_event_comm(tsk);
 }
 
-int flush_old_exec(struct linux_binprm * bprm)
+int flush_exec_keep_thread(struct linux_binprm * bprm)
 {
 	int retval;
 
-	/*
-	 * Make sure we have a private signal table and that
-	 * we are unassociated from the previous thread group.
-	 */
-	retval = de_thread(current);
-	if (retval)
-		goto out;
-
 	set_mm_exe_file(bprm->mm, bprm->file);
 
 	/*
@@ -1101,10 +1093,25 @@ int flush_old_exec(struct linux_binprm *
 	current->personality &= ~bprm->per_clear;
 
 	return 0;
-
 out:
 	return retval;
 }
+EXPORT_SYMBOL(flush_exec_keep_thread);
+
+int flush_old_exec(struct linux_binprm * bprm)
+{
+	int retval;
+
+	/*
+	 * Make sure we have a private signal table and that
+	 * we are unassociated from the previous thread group.
+	 */
+	retval = de_thread(current);
+	if (retval)
+		return retval;
+
+	return flush_exec_keep_thread(bprm);
+}
 EXPORT_SYMBOL(flush_old_exec);
 
 void would_dump(struct linux_binprm *bprm, struct file *file)
Index: linux-2.6.git/include/linux/binfmts.h
===================================================================
--- linux-2.6.git.orig/include/linux/binfmts.h
+++ linux-2.6.git/include/linux/binfmts.h
@@ -110,6 +110,7 @@ extern int prepare_binprm(struct linux_b
 extern int __must_check remove_arg_zero(struct linux_binprm *);
 extern int search_binary_handler(struct linux_binprm *, struct pt_regs *);
 extern int flush_old_exec(struct linux_binprm * bprm);
+extern int flush_exec_keep_thread(struct linux_binprm * bprm);
 extern void setup_new_exec(struct linux_binprm * bprm);
 extern void would_dump(struct linux_binprm *, struct file *);
 
Index: linux-2.6.git/include/linux/elf_ckpt.h
===================================================================
--- /dev/null
+++ linux-2.6.git/include/linux/elf_ckpt.h
@@ -0,0 +1,103 @@
+#ifndef _LINUX_ELF_CHECKPOINT_H
+#define _LINUX_ELF_CHECKPOINT_H
+
+#ifdef __KERNEL__
+
+#include <linux/types.h>
+#include <linux/elf-em.h>
+
+#include <asm/elf.h>
+#include <asm/elf_ckpt.h>
+
+/*
+ * Elf extension includes new Elf file type
+ * and program header types as well.
+ */
+#define ET_CKPT				5
+
+#define PT_CKPT_OFFSET			0x01010101
+
+#define PT_CKPT_VMA			(PT_LOOS + PT_CKPT_OFFSET + 1)
+#define PT_CKPT_CORE			(PT_LOOS + PT_CKPT_OFFSET + 2)
+#define PT_CKPT_PAGES			(PT_LOOS + PT_CKPT_OFFSET + 3)
+
+#define CKPT_PAGE_SIZE			4096
+#define CKPT_TASK_COMM_LEN		16
+
+#define CKPT_HEADER_VERSION		1
+#define CKPT_HEADER_ARCH_X86_64		1
+
+#define VMA_AREA_REGULAR		(1 <<  0)
+#define VMA_AREA_STACK			(1 <<  1)
+#define VMA_AREA_VSYSCALL		(1 <<  2)
+#define VMA_AREA_VDSO			(1 <<  3)
+#define VMA_FORCE_READ			(1 <<  4)
+#define VMA_AREA_HEAP			(1 <<  5)
+#define VMA_FILE_PRIVATE		(1 <<  6)
+#define VMA_FILE_SHARED			(1 <<  7)
+#define VMA_ANON_SHARED			(1 <<  8)
+#define VMA_ANON_PRIVATE		(1 <<  9)
+#define VMA_FORCE_WRITE			(1 << 10)
+
+struct vma_entry {
+	__u64	start;
+	__u64	end;
+	__u64	pgoff;
+	__u32	prot;
+	__u32	flags;
+	__u32	status;			/* from VMA_x above */
+	__u32	pid;			/* pid VMA belongs to */
+	__s64	fd;
+	__u64	ino;
+	__u32	dev_maj;
+	__u32	dev_min;
+} __packed;
+
+struct page_entry {
+	__u64	va;			/* page virtual address */
+	__u8	data[CKPT_PAGE_SIZE];	/* page contents */
+} __packed;
+
+struct image_header {
+	__u16	version;
+	__u16	arch;
+	__u32	flags;
+} __packed;
+
+#define CKPT_ARCH_SIZE		(2 * 4096)
+#define CKPT_CORE_SIZE		(4 * 4096)
+
+struct core_entry {
+  union {
+    struct {
+	struct image_header		header;
+	__u8				arch[CKPT_ARCH_SIZE];	/* should be enough for all archs */
+	__u32				task_personality;
+	__u8				task_comm[CKPT_TASK_COMM_LEN];
+	__u32				task_flags;
+	__u64				mm_start_code;
+	__u64				mm_end_code;
+	__u64				mm_start_data;
+	__u64				mm_end_data;
+	__u64				mm_start_stack;
+	__u64				mm_start_brk;
+	__u64				mm_brk;
+    };
+    __u8				__core_pad[CKPT_CORE_SIZE];
+  };
+} __packed;
+
+#ifdef CONFIG_BINFMT_ELF_CKPT
+extern int load_elf_ckpt(struct linux_binprm *bprm, struct pt_regs *regs,
+			 struct elfhdr *elf_ex, struct elf_phdr *elf_phdr);
+#else
+static inline int load_elf_ckpt(struct linux_binprm *bprm, struct pt_regs *regs,
+				struct elfhdr *elf_ex, struct elf_phdr *elf_phdr)
+{
+	return -ENOEXEC;
+}
+#endif
+
+#endif /* __KERNEL__ */
+
+#endif /* _LINUX_ELF_CHECKPOINT_H */


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 1/5] proc: Introduce the Children: line in /proc/<pid>/status
  2011-10-14 11:04 ` [patch 1/5] proc: Introduce the Children: line in /proc/<pid>/status Cyrill Gorcunov
@ 2011-10-14 16:36   ` Tejun Heo
  0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2011-10-14 16:36 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Vagin, Pavel Emelyanov, James Bottomley,
	Glauber Costa, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan, Serge Hallyn

On Fri, Oct 14, 2011 at 03:04:17PM +0400, Cyrill Gorcunov wrote:
> From: Pavel Emelyanov <xemul@parallels.com>
> 
> Although we can get the pids of some task's issue, this is just 
> more convenient to have them this way.
> 
> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>

I've already written this but this info is already available through
/proc.  Maybe having this is beneficial but I think it would be better
to concentrate on the minimal set for now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 3/5] fs, proc: Add /proc/$pid/tls entry
  2011-10-14 11:04 ` [patch 3/5] fs, proc: Add /proc/$pid/tls entry Cyrill Gorcunov
@ 2011-10-14 16:40   ` Tejun Heo
  2011-10-14 16:43     ` Cyrill Gorcunov
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2011-10-14 16:40 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Vagin, Pavel Emelyanov, James Bottomley,
	Glauber Costa, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan

On Fri, Oct 14, 2011 at 03:04:19PM +0400, Cyrill Gorcunov wrote:
> To be able to restart checkpointed tasks we need
> to know TLS status at dumping time. Export this
> information by /proc/$pid/tls entry.
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>

No objection from me but it would be great if you can go over other
archs and see how other archs deal with tls just in case.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 3/5] fs, proc: Add /proc/$pid/tls entry
  2011-10-14 16:40   ` Tejun Heo
@ 2011-10-14 16:43     ` Cyrill Gorcunov
  0 siblings, 0 replies; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-14 16:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Andrew Vagin, Pavel Emelyanov, James Bottomley,
	Glauber Costa, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan

On Fri, Oct 14, 2011 at 09:40:50AM -0700, Tejun Heo wrote:
> On Fri, Oct 14, 2011 at 03:04:19PM +0400, Cyrill Gorcunov wrote:
> > To be able to restart checkpointed tasks we need
> > to know TLS status at dumping time. Export this
> > information by /proc/$pid/tls entry.
> > 
> > Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> 
> No objection from me but it would be great if you can go over other
> archs and see how other archs deal with tls just in case.
> 

I'll check, at moment I found it at x86 only (in thread_info) but definitely
need to re-check. Thanks Tejun!

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-14 11:04 ` [patch 5/5] elf: Add support for loading ET_CKPT files Cyrill Gorcunov
@ 2011-10-14 17:10   ` Tejun Heo
  2011-10-14 17:33     ` Tejun Heo
                       ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Tejun Heo @ 2011-10-14 17:10 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Vagin, Pavel Emelyanov, James Bottomley,
	Glauber Costa, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

Hello,

(cc'ing Oleg and Linus, and quoting whole body)

On Fri, Oct 14, 2011 at 03:04:21PM +0400, Cyrill Gorcunov wrote:
> This patch add ability to run that named "checkpoint" files by
> enhancing Elf file format, which includes
> 
>  - new Elf file type ET_CKPT
> 
>  - three additional program header types PT_CKPT_VMA, PT_CKPT_CORE
>    and PT_CKPT_PAGES.
> 
>      PT_CKPT_VMA -- holds 'vma_entry' structure, which describes the
>      memory area the kernel should map. It also might contain a file descriptor
>      so the kernel will be mapping a file povided. Usually such file get
>      opened by user-space helper tool which prepares 'vma_entry' structure
>      for the kernel.
> 
>      PT_CKPT_CORE -- 'core_entry' structure (registers, tls, tasks specific
>      settings). The structure is defined as a 16K container which should be
>      enough for most cases. 8K of it is reserved for arch specific settings.
> 
>      PT_CKPT_PAGES -- a set of all pages which contents we should restored.
> 
> Apart from Elf extension flush_old_exec() has been splitted to two
> functions -- the former flush_old_exec() and flush_exec_keep_thread().
> The later doesn't call for de_thread() allowing to keep threads
> relationship. Also arch_setup_additional_pages_at() helper added
> to setup vdso at predefined address.
> 
> At moment only pure x86-64 architecture is supported.

I don't think this is a good idea.  We already have most of interface
necessary for restoring task state and there's no need to put it into
the kernel as one piece.  If there's something which can't be done
from userland using existing interfaces, let's please discuss what
they are and whether they can be resolved differently first.

The exec path is very intricate as it is and it would be very easy to
introduce security or other bugs by altering its basic behavior.  exec
presumes destruction of (most of) the current process and all its
other threads and replacing them w/ fresh states from an executable.
The scary part - interaction with process hierarchy and zapping of the
current state - is handled from the core exec code.

I see that you removed zapping to allow restoring multi-threaded
process, which seems quite scary to me.  It might be okay, I don't
know, but IMHO it just isn't a very good idea to introduce such
variation to basic exec behavior for this rather narrow use case.

In addition, leaving it alone doesn't really solve multi-threaded
restoring, does it?  Who sets the task states for other threads?  The
current code doesn't seem to be doing anything but if you're gonna add
it later, how are you gonna synchronize other threads?  If not, what's
the point of pushing this chunk in when userland task state
restoration is necessary anyway?

So, at the moment, I'm rather strongly against this approach.

Thank you.

> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
> CC: Andrew Vagin <avagin@parallels.com>
> CC: Pavel Emelyanov <xemul@parallels.com>
> CC: James Bottomley <jbottomley@parallels.com>
> CC: Glauber Costa <glommer@parallels.com>
> CC: H. Peter Anvin <hpa@zytor.com>
> CC: Ingo Molnar <mingo@elte.hu>
> CC: Tejun Heo <tj@kernel.org>
> CC: Dave Hansen <dave@linux.vnet.ibm.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: Daniel Lezcano <dlezcano@fr.ibm.com>
> CC: Alexey Dobriyan <adobriyan@gmail.com>
> ---
>  arch/x86/include/asm/elf.h      |    3 
>  arch/x86/include/asm/elf_ckpt.h |   80 ++++++++
>  arch/x86/kernel/Makefile        |    2 
>  arch/x86/kernel/elf_ckpt.c      |  161 ++++++++++++++++++
>  arch/x86/vdso/vma.c             |   22 ++
>  fs/Kconfig.binfmt               |   11 +
>  fs/Makefile                     |    1 
>  fs/binfmt_elf.c                 |   17 +
>  fs/binfmt_elf_ckpt.c            |  356 ++++++++++++++++++++++++++++++++++++++++
>  fs/exec.c                       |   27 +--
>  include/linux/binfmts.h         |    1 
>  include/linux/elf_ckpt.h        |  103 +++++++++++
>  12 files changed, 772 insertions(+), 12 deletions(-)
> 
> Index: linux-2.6.git/arch/x86/include/asm/elf.h
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/include/asm/elf.h
> +++ linux-2.6.git/arch/x86/include/asm/elf.h
> @@ -314,7 +314,8 @@ struct linux_binprm;
>  #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
>  extern int arch_setup_additional_pages(struct linux_binprm *bprm,
>  				       int uses_interp);
> -
> +extern int arch_setup_additional_pages_at(struct linux_binprm *bprm,
> +					  void *addr, int uses_interp);
>  extern int syscall32_setup_pages(struct linux_binprm *, int exstack);
>  #define compat_arch_setup_additional_pages	syscall32_setup_pages
>  
> Index: linux-2.6.git/arch/x86/include/asm/elf_ckpt.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/arch/x86/include/asm/elf_ckpt.h
> @@ -0,0 +1,80 @@
> +#ifndef _LINUX_ELF_X86_CHECKPOINT_H
> +#define _LINUX_ELF_X86_CHECKPOINT_H
> +
> +#include <linux/errno.h>
> +
> +#include <asm/types.h>
> +#include <asm/ptrace.h>
> +
> +#define CKPT_GDT_ENTRY_TLS_ENTRIES	3
> +
> +struct user_regs_entry {
> +	__u64	r15;
> +	__u64	r14;
> +	__u64	r13;
> +	__u64	r12;
> +	__u64	bp;
> +	__u64	bx;
> +	__u64	r11;
> +	__u64	r10;
> +	__u64	r9;
> +	__u64	r8;
> +	__u64	ax;
> +	__u64	cx;
> +	__u64	dx;
> +	__u64	si;
> +	__u64	di;
> +	__u64	orig_ax;
> +	__u64	ip;
> +	__u64	cs;
> +	__u64	flags;
> +	__u64	sp;
> +	__u64	ss;
> +	__u64	fs_base;
> +	__u64	gs_base;
> +	__u64	ds;
> +	__u64	es;
> +	__u64	fs;
> +	__u64	gs;
> +} __packed;
> +
> +struct desc_struct_entry {
> +	__u32	a;
> +	__u32	b;
> +} __packed;
> +
> +struct user_fpregs_entry {
> +	__u16	cwd;
> +	__u16	swd;
> +	__u16	twd;
> +	__u16	fop;
> +	__u64	rip;
> +	__u64	rdp;
> +	__u32	mxcsr;
> +	__u32	mxcsr_mask;
> +	__u32	st_space[32];
> +	__u32	xmm_space[64];
> +	__u32	padding[24];
> +} __packed;
> +
> +struct ckpt_arch_entry {
> +	struct user_regs_entry		gpregs;
> +	struct user_fpregs_entry	fpregs;
> +	struct desc_struct		tls_array[CKPT_GDT_ENTRY_TLS_ENTRIES];
> +};
> +
> +struct core_entry;
> +
> +#ifdef CONFIG_X86_64
> +extern int load_elf_ckpt_arch(struct task_struct *tsk, struct pt_regs *regs,
> +			      struct core_entry *core_entry);
> +#else
> +static inline int
> +load_elf_ckpt_arch(struct task_struct *tsk, struct pt_regs *regs,
> +		   struct core_entry *core_entry)
> +{
> +	return -ENOEXEC;
> +}
> +#endif
> +
> +#endif /* _LINUX_ELF_X86_CHECKPOINT_H */
> Index: linux-2.6.git/arch/x86/kernel/Makefile
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/kernel/Makefile
> +++ linux-2.6.git/arch/x86/kernel/Makefile
> @@ -99,6 +99,8 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) 
>  obj-$(CONFIG_SWIOTLB)			+= pci-swiotlb.o
>  obj-$(CONFIG_OF)			+= devicetree.o
>  
> +obj-$(CONFIG_BINFMT_ELF_CKPT)		+= elf_ckpt.o
> +
>  ###
>  # 64 bit specific files
>  ifeq ($(CONFIG_X86_64),y)
> Index: linux-2.6.git/arch/x86/kernel/elf_ckpt.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/arch/x86/kernel/elf_ckpt.c
> @@ -0,0 +1,161 @@
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/errno.h>
> +#include <linux/signal.h>
> +#include <linux/binfmts.h>
> +#include <linux/string.h>
> +#include <linux/file.h>
> +#include <linux/slab.h>
> +#include <linux/personality.h>
> +#include <linux/elfcore.h>
> +#include <linux/init.h>
> +#include <linux/highuid.h>
> +#include <linux/compiler.h>
> +#include <linux/highmem.h>
> +#include <linux/pagemap.h>
> +#include <linux/security.h>
> +#include <linux/random.h>
> +#include <linux/elf.h>
> +#include <linux/utsname.h>
> +#include <linux/coredump.h>
> +#include <linux/regset.h>
> +
> +#include <asm/uaccess.h>
> +#include <asm/param.h>
> +#include <asm/page.h>
> +#include <asm/prctl.h>
> +#include <asm/proto.h>
> +#include <asm/i387.h>
> +
> +#include <linux/elf_ckpt.h>
> +#include <linux/flex_array.h>
> +#include <asm/tlbflush.h>
> +#include <asm/desc.h>
> +
> +#ifdef CONFIG_X86_64
> +
> +#define cp_reg(d, s, r) d.r = s.r
> +
> +int load_elf_ckpt_arch(struct task_struct *tsk, struct pt_regs *regs,
> +		       struct core_entry *core_entry)
> +{
> +	struct ckpt_arch_entry *arch = (struct ckpt_arch_entry *)core_entry->arch;
> +	struct thread_struct *thread = &current->thread;
> +
> +	struct user_regs_struct gpregs;
> +	struct user_i387_struct	fpregs;
> +
> +	mm_segment_t old_fs;
> +	int i, ret;
> +
> +	if (core_entry->header.arch != CKPT_HEADER_ARCH_X86_64) {
> +		pr_err("elf-ckpt-x86: Unsupported or corrupted header\n");
> +		return -ENOEXEC;
> +	}
> +
> +	BUILD_BUG_ON(CKPT_GDT_ENTRY_TLS_ENTRIES != GDT_ENTRY_TLS_ENTRIES);
> +	BUILD_BUG_ON(sizeof(struct ckpt_arch_entry) > CKPT_ARCH_SIZE);
> +
> +	memset(&gpregs, 0, sizeof(gpregs));
> +	memset(&fpregs, 0, sizeof(fpregs));
> +
> +	/*
> +	 * General purpose registers
> +	 */
> +	cp_reg(gpregs, arch->gpregs, r15);
> +	cp_reg(gpregs, arch->gpregs, r14);
> +	cp_reg(gpregs, arch->gpregs, r13);
> +	cp_reg(gpregs, arch->gpregs, r12);
> +	cp_reg(gpregs, arch->gpregs, bp);
> +	cp_reg(gpregs, arch->gpregs, bx);
> +	cp_reg(gpregs, arch->gpregs, r11);
> +	cp_reg(gpregs, arch->gpregs, r10);
> +	cp_reg(gpregs, arch->gpregs, r9);
> +	cp_reg(gpregs, arch->gpregs, r8);
> +	cp_reg(gpregs, arch->gpregs, ax);
> +	cp_reg(gpregs, arch->gpregs, cx);
> +	cp_reg(gpregs, arch->gpregs, dx);
> +	cp_reg(gpregs, arch->gpregs, si);
> +	cp_reg(gpregs, arch->gpregs, di);
> +	cp_reg(gpregs, arch->gpregs, orig_ax);
> +	cp_reg(gpregs, arch->gpregs, ip);
> +	cp_reg(gpregs, arch->gpregs, cs);
> +	cp_reg(gpregs, arch->gpregs, flags);
> +	cp_reg(gpregs, arch->gpregs, sp);
> +	cp_reg(gpregs, arch->gpregs, ss);
> +	cp_reg(gpregs, arch->gpregs, fs_base);
> +	cp_reg(gpregs, arch->gpregs, gs_base);
> +	cp_reg(gpregs, arch->gpregs, ds);
> +	cp_reg(gpregs, arch->gpregs, es);
> +	cp_reg(gpregs, arch->gpregs, fs);
> +	cp_reg(gpregs, arch->gpregs, gs);
> +
> +	old_fs = get_fs();
> +	set_fs(KERNEL_DS);
> +	ret = arch_ptrace(current, PTRACE_SETREGS, 0, (unsigned long)&gpregs);
> +	set_fs(old_fs);
> +	if (ret)
> +		goto out;
> +
> +	*regs = *task_pt_regs(current);
> +
> +	thread->usersp	= arch->gpregs.sp;
> +	thread->ds	= arch->gpregs.ds;
> +	thread->es	= arch->gpregs.es;
> +	thread->fs	= arch->gpregs.fs;
> +	thread->gs	= arch->gpregs.gs;
> +
> +	thread->fsindex	= thread->fs;
> +	thread->gsindex = thread->gs;
> +
> +	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) {
> +		thread->tls_array[i].a = arch->tls_array[i].a;
> +		thread->tls_array[i].b = arch->tls_array[i].b;
> +	}
> +
> +	if (arch->gpregs.fs_base) {
> +		ret = do_arch_prctl(current, ARCH_SET_FS, arch->gpregs.fs_base);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	if (arch->gpregs.gs_base) {
> +		ret = do_arch_prctl(current, ARCH_SET_GS, arch->gpregs.gs_base);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	/* Restoring FPU */
> +	if (core_entry->task_flags & PF_USED_MATH) {
> +
> +		cp_reg(fpregs, arch->fpregs, cwd);
> +		cp_reg(fpregs, arch->fpregs, swd);
> +		cp_reg(fpregs, arch->fpregs, twd);
> +		cp_reg(fpregs, arch->fpregs, fop);
> +		cp_reg(fpregs, arch->fpregs, rip);
> +		cp_reg(fpregs, arch->fpregs, rdp);
> +		cp_reg(fpregs, arch->fpregs, mxcsr);
> +		cp_reg(fpregs, arch->fpregs, mxcsr_mask);
> +
> +		for (i = 0; i < ARRAY_SIZE(arch->fpregs.st_space); i++)
> +			cp_reg(fpregs, arch->fpregs, st_space[i]);
> +
> +		for (i = 0; i < ARRAY_SIZE(arch->fpregs.xmm_space); i++)
> +			cp_reg(fpregs, arch->fpregs, xmm_space[i]);
> +
> +		old_fs = get_fs();
> +		set_fs(KERNEL_DS);
> +		ret = arch_ptrace(current, PTRACE_SETFPREGS, 0, (unsigned long)&fpregs);
> +		set_fs(old_fs);
> +		if (ret)
> +			goto out;
> +	}
> +
> +out:
> +	return ret;
> +}
> +
> +#endif /* CONFIG_X86_64 */
> Index: linux-2.6.git/arch/x86/vdso/vma.c
> ===================================================================
> --- linux-2.6.git.orig/arch/x86/vdso/vma.c
> +++ linux-2.6.git/arch/x86/vdso/vma.c
> @@ -137,6 +137,28 @@ up_fail:
>  	return ret;
>  }
>  
> +int arch_setup_additional_pages_at(struct linux_binprm *bprm, void *addr, int uses_interp)
> +{
> +	struct mm_struct *mm = current->mm;
> +	int ret;
> +
> +	if (!vdso_enabled)
> +		return 0;
> +
> +	down_write(&mm->mmap_sem);
> +	current->mm->context.vdso = addr;
> +	ret = install_special_mapping(mm, (unsigned long)addr, vdso_size,
> +				      VM_READ | VM_EXEC |
> +				      VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC |
> +				      VM_ALWAYSDUMP,
> +				      vdso_pages);
> +	if (ret)
> +		current->mm->context.vdso = NULL;
> +
> +	up_write(&mm->mmap_sem);
> +	return ret;
> +}
> +
>  static __init int vdso_setup(char *s)
>  {
>  	vdso_enabled = simple_strtoul(s, NULL, 0);
> Index: linux-2.6.git/fs/Kconfig.binfmt
> ===================================================================
> --- linux-2.6.git.orig/fs/Kconfig.binfmt
> +++ linux-2.6.git/fs/Kconfig.binfmt
> @@ -23,6 +23,17 @@ config BINFMT_ELF
>  	  ld.so (check the file <file:Documentation/Changes> for location and
>  	  latest version).
>  
> +config BINFMT_ELF_CKPT
> +	tristate "Kernel support for CKPT ELF binaries"
> +	default n
> +	depends on BINFMT_ELF && X86_64
> +	help
> +	  ELF CKPT (checkpoint) is an extension to ELF format to restore
> +	  checkpointed processes. It's not confirmed yet and highly
> +	  experimental.
> +
> +	  If unsure, say N.
> +
>  config COMPAT_BINFMT_ELF
>  	bool
>  	depends on COMPAT && BINFMT_ELF
> Index: linux-2.6.git/fs/Makefile
> ===================================================================
> --- linux-2.6.git.orig/fs/Makefile
> +++ linux-2.6.git/fs/Makefile
> @@ -37,6 +37,7 @@ obj-$(CONFIG_BINFMT_MISC)	+= binfmt_misc
>  obj-y				+= binfmt_script.o
>  
>  obj-$(CONFIG_BINFMT_ELF)	+= binfmt_elf.o
> +obj-$(CONFIG_BINFMT_ELF_CKPT)	+= binfmt_elf_ckpt.o
>  obj-$(CONFIG_COMPAT_BINFMT_ELF)	+= compat_binfmt_elf.o
>  obj-$(CONFIG_BINFMT_ELF_FDPIC)	+= binfmt_elf_fdpic.o
>  obj-$(CONFIG_BINFMT_SOM)	+= binfmt_som.o
> Index: linux-2.6.git/fs/binfmt_elf.c
> ===================================================================
> --- linux-2.6.git.orig/fs/binfmt_elf.c
> +++ linux-2.6.git/fs/binfmt_elf.c
> @@ -30,6 +30,7 @@
>  #include <linux/security.h>
>  #include <linux/random.h>
>  #include <linux/elf.h>
> +#include <linux/elf_ckpt.h>
>  #include <linux/utsname.h>
>  #include <linux/coredump.h>
>  #include <asm/uaccess.h>
> @@ -592,7 +593,11 @@ static int load_elf_binary(struct linux_
>  	if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
>  		goto out;
>  
> -	if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
> +	if (loc->elf_ex.e_type != ET_EXEC &&
> +#ifdef CONFIG_BINFMT_ELF_CKPT
> +	    loc->elf_ex.e_type != ET_CKPT  &&
> +#endif
> +	    loc->elf_ex.e_type != ET_DYN)
>  		goto out;
>  	if (!elf_check_arch(&loc->elf_ex))
>  		goto out;
> @@ -619,6 +624,16 @@ static int load_elf_binary(struct linux_
>  		goto out_free_ph;
>  	}
>  
> +#ifdef CONFIG_BINFMT_ELF_CKPT
> +	if (loc->elf_ex.e_type == ET_CKPT) {
> +		retval = load_elf_ckpt(bprm, regs, &loc->elf_ex,
> +				       (struct elf_phdr *)elf_phdata);
> +		if (!retval)
> +			set_binfmt(&elf_format);
> +		goto out_free_ph;
> +	}
> +#endif
> +
>  	elf_ppnt = elf_phdata;
>  	elf_bss = 0;
>  	elf_brk = 0;
> Index: linux-2.6.git/fs/binfmt_elf_ckpt.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/fs/binfmt_elf_ckpt.c
> @@ -0,0 +1,356 @@
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/mman.h>
> +#include <linux/errno.h>
> +#include <linux/signal.h>
> +#include <linux/binfmts.h>
> +#include <linux/string.h>
> +#include <linux/file.h>
> +#include <linux/slab.h>
> +#include <linux/personality.h>
> +#include <linux/elfcore.h>
> +#include <linux/init.h>
> +#include <linux/highuid.h>
> +#include <linux/compiler.h>
> +#include <linux/highmem.h>
> +#include <linux/pagemap.h>
> +#include <linux/security.h>
> +#include <linux/random.h>
> +#include <linux/elf.h>
> +#include <linux/utsname.h>
> +#include <linux/coredump.h>
> +#include <linux/regset.h>
> +
> +#include <asm/uaccess.h>
> +#include <asm/param.h>
> +#include <asm/page.h>
> +#include <asm/prctl.h>
> +#include <asm/proto.h>
> +#include <asm/i387.h>
> +
> +#include <linux/elf_ckpt.h>
> +#include <asm/elf_ckpt.h>
> +
> +#include <linux/flex_array.h>
> +#include <asm/tlbflush.h>
> +#include <asm/desc.h>
> +
> +int load_elf_ckpt(struct linux_binprm *bprm, struct pt_regs *regs,
> +		  struct elfhdr *elf_ex, struct elf_phdr *elf_phdr)
> +{
> +	struct elf_phdr *elf_phdr_pages;
> +	struct flex_array *fa = NULL;
> +	struct vma_entry *vma_entry_ptr;
> +	int nr_vma_found, nr_vma_mapped;
> +	struct vma_entry vma_entry;
> +	struct file *file = NULL;
> +	unsigned long map_addr;
> +
> +#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
> +	unsigned long vdso = -1UL;
> +#endif
> +
> +	struct core_entry *core_entry = NULL;
> +	unsigned long start_stack = -1UL;
> +
> +	int i, ret = -ENOEXEC;
> +	loff_t off;
> +
> +	BUILD_BUG_ON(CKPT_TASK_COMM_LEN != TASK_COMM_LEN);
> +	BUILD_BUG_ON(CKPT_PAGE_SIZE != PAGE_SIZE);
> +	BUILD_BUG_ON(CKPT_CORE_SIZE != sizeof(*core_entry));
> +
> +	elf_phdr_pages	= NULL;
> +	nr_vma_found	= 0;
> +	nr_vma_mapped	= 0;
> +
> +	/*
> +	 * An early check for header version so if we fail here
> +	 * we would not need to use flex array at all.
> +	 */
> +	for (i = 0; i < elf_ex->e_phnum; i++) {
> +		if (elf_phdr[i].p_type != PT_CKPT_CORE)
> +			continue;
> +
> +		core_entry = vmalloc(sizeof(*core_entry));
> +		if (!core_entry) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		ret = kernel_read(bprm->file, elf_phdr[i].p_offset,
> +				  (char *)core_entry, sizeof(*core_entry));
> +		if (ret != sizeof(*core_entry)) {
> +			pr_err("elf-ckpt: Can't read core_entry\n");
> +			ret = -EIO;
> +			goto out;
> +		}
> +
> +		if (core_entry->header.version != CKPT_HEADER_VERSION) {
> +			pr_err("elf-ckpt: Unsupported or corrupted header\n");
> +			ret = -ENOEXEC;
> +			goto out;
> +		}
> +
> +		break;
> +	}
> +
> +	if (i == elf_ex->e_phnum) {
> +		pr_err("elf-ckpt: No header found\n");
> +		ret = -ENOEXEC;
> +		goto out;
> +	}
> +
> +
> +	fa = flex_array_alloc(sizeof(vma_entry), elf_ex->e_phnum, GFP_KERNEL);
> +	if (!fa || flex_array_prealloc(fa, 0, elf_ex->e_phnum, GFP_KERNEL)) {
> +		ret = -ENOMEM;
> +		if (fa) {
> +			flex_array_free(fa);
> +			fa = NULL;
> +			goto out;
> +		}
> +	}
> +
> +	ret = flush_exec_keep_thread(bprm);
> +	if (ret)
> +		goto out;
> +
> +	current->flags &= ~PF_FORKNOEXEC;
> +	current->mm->def_flags = 0;
> +
> +	/*
> +	 * We don't care about parameters passed (such as argc, argv, env)
> +	 * when execute checkpoint file because we're to substitute
> +	 * all things anyway.
> +	 */
> +	do_munmap(current->mm, 0, TASK_SIZE);
> +
> +	SET_PERSONALITY(loc->elf_ex);
> +
> +	for (i = 0; i < elf_ex->e_phnum; i++) {
> +
> +		switch (elf_phdr[i].p_type) {
> +		case PT_CKPT_VMA:
> +			ret = kernel_read(bprm->file, elf_phdr[i].p_offset,
> +					  (char *)&vma_entry, sizeof(vma_entry));
> +			if (ret != sizeof(vma_entry)) {
> +				pr_err("elf-ckpt: Can't read vma_entry\n");
> +				ret = -EIO;
> +				goto out;
> +			}
> +			if (flex_array_put(fa, i, &vma_entry, GFP_KERNEL))
> +				BUG();
> +
> +			/* We need to know if there is executable stack */
> +			if (vma_entry.status & VMA_AREA_STACK) {
> +				if (vma_entry.flags & PROT_EXEC)
> +					current->personality |= READ_IMPLIES_EXEC;
> +			}
> +
> +			nr_vma_found++;
> +			continue;
> +		case PT_CKPT_PAGES:
> +			elf_phdr_pages = &elf_phdr[i];
> +			continue;
> +		default:
> +			continue;
> +		}
> +	}
> +
> +	/* Be sure it has the file structure we expected to see. */
> +	if (!elf_phdr_pages || !nr_vma_found) {
> +		ret = -ENOEXEC;
> +		goto out;
> +	}
> +
> +	/*
> +	 * VMA randomization still needs to be set (just in case if
> +	 * the program we restore will exec() something else later).
> +	 */
> +	if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
> +		current->flags |= PF_RANDOMIZE;
> +
> +	/*
> +	 * FIXME: Note it flushes signal handlers as well,
> +	 * so we need to dump queued signals and restore
> +	 * them here.
> +	 */
> +	setup_new_exec(bprm);
> +
> +	current->mm->free_area_cache = current->mm->mmap_base;
> +	current->mm->cached_hole_size = 0;
> +
> +	for (i = 0; i < nr_vma_found; i++) {
> +		vma_entry_ptr = flex_array_get(fa, i);
> +
> +#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
> +		if (vma_entry_ptr->status & VMA_AREA_VDSO)
> +			vdso = vma_entry_ptr->start;
> +#endif
> +
> +		if (vma_entry_ptr->status & VMA_AREA_STACK) {
> +			/* Note if stack is VM_GROWSUP -- it should be reversed */
> +			start_stack = vma_entry_ptr->start;
> +		}
> +
> +		/* Anything special should be ignored */
> +		if (!(vma_entry_ptr->status & VMA_AREA_REGULAR))
> +			continue;
> +
> +		/* It's a file mmap'ed */
> +		if (vma_entry_ptr->fd != -1) {
> +			file = fget((unsigned int)vma_entry_ptr->fd);
> +			if (!file) {
> +				ret = -EBADF;
> +				goto out_unmap;
> +			}
> +
> +			/* Reuse this field to handle error cases */
> +			vma_entry_ptr->fd = (__u64)file;
> +		} else
> +			file = NULL;
> +
> +		down_write(&current->mm->mmap_sem);
> +		map_addr = do_mmap(file,
> +				   vma_entry_ptr->start,
> +				   vma_entry_ptr->end - vma_entry_ptr->start,
> +				   vma_entry_ptr->prot,
> +				   vma_entry_ptr->flags | MAP_FIXED,
> +				   vma_entry_ptr->pgoff);
> +		up_write(&current->mm->mmap_sem);
> +
> +		if (file) {
> +			fput(file);
> +			do_close((unsigned int)vma_entry_ptr->fd);
> +		}
> +
> +		if ((unsigned long)(map_addr) >= TASK_SIZE) {
> +			ret = IS_ERR((void *)map_addr) ? PTR_ERR((void*)map_addr) : -EINVAL;
> +			goto out_unmap;
> +		}
> +
> +		nr_vma_mapped++;
> +	}
> +
> +#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
> +	if (vdso == -1UL) {
> +		pr_err("elf-ckpt: Can't find VDSO address\n");
> +		ret = -ENOEXEC;
> +		goto out_unmap;
> +	}
> +#endif
> +
> +	if (start_stack == -1UL) {
> +		pr_err("elf-ckpt: Can't find stack VMA\n");
> +		ret = -ENOEXEC;
> +		goto out_unmap;
> +	}
> +
> +	/* The name it has before */
> +	set_task_comm(current, core_entry->task_comm);
> +
> +	bprm->p = core_entry->mm_start_stack;
> +
> +	current->mm->start_code		= core_entry->mm_start_code;
> +	current->mm->end_code		= core_entry->mm_end_code;
> +	current->mm->start_data		= core_entry->mm_start_data;
> +	current->mm->end_data		= core_entry->mm_end_data;
> +	current->mm->start_stack	= core_entry->mm_start_stack;
> +	current->mm->start_brk		= core_entry->mm_start_brk;
> +	current->mm->brk		= core_entry->mm_brk;
> +
> +#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
> +	ret = arch_setup_additional_pages_at(bprm, (void *)vdso, 0);
> +	if (ret) {
> +		pr_err("elf-ckpt: Can't setup additional pages at %lx with %d\n",
> +			vdso, ret);
> +		goto out_unmap;
> +	}
> +#endif
> +
> +	/*
> +	 * Restore pages
> +	 */
> +	off = elf_phdr_pages->p_offset;
> +	while (1) {
> +		struct vm_area_struct *vma;
> +		struct page *page;
> +		void *page_data;
> +		__u64 va;
> +
> +		ret = kernel_read(bprm->file, off, (char *)&va, sizeof(va));
> +		if (ret != sizeof(va)) {
> +			pr_err("elf-ckpt: Can't read page virtual address: "
> +			       "ret = %d off = %lx\n", ret, (unsigned long)off);
> +			ret = -EIO;
> +			goto out_unmap;
> +		}
> +
> +		/* End of pages reached */
> +		if (!va)
> +			break;
> +
> +		vma = find_vma(current->mm, (unsigned long)va);
> +		if (!vma) {
> +			pr_err("elf-ckpt: No VMA for page: %16lx\n", (unsigned long)va);
> +			ret = -ESRCH;
> +			goto out_unmap;
> +		}
> +
> +		ret = get_user_pages(current, current->mm, (unsigned long)va,
> +				     1, 1, 1, &page, NULL);
> +		if (ret != 1) {
> +			pr_err("elf-ckpt: Can't get user page: %16lx\n", (unsigned long)va);
> +			ret = -EFAULT;
> +			goto out_unmap;
> +		}
> +
> +		page_data = kmap(page);
> +		ret = kernel_read(bprm->file, off + sizeof(va), page_data, PAGE_SIZE);
> +		kunmap(page);
> +		put_page(page);
> +
> +		if (ret != PAGE_SIZE) {
> +			pr_err("elf-ckpt: Can't read data on page: %16lx\n", (unsigned long)va);
> +			ret = -EFAULT;
> +			goto out_unmap;
> +		}
> +
> +		off += sizeof(va) + PAGE_SIZE;
> +	}
> +
> +	/*
> +	 * Architecture specific setup for registers
> +	 * and friends, it's done lately since if
> +	 * an error happened before there is no much
> +	 * point to setup arch-specific things at all.
> +	*/
> +	ret = load_elf_ckpt_arch(current, regs, core_entry);
> +	if (ret)
> +		goto out_unmap;
> +
> +	/* We're done */
> +	ret = 0;
> +out:
> +	if (core_entry)
> +		vfree(core_entry);
> +
> +	if (fa)
> +		flex_array_free(fa);
> +	return ret;
> +
> +out_unmap:
> +	for (i = 0; i < nr_vma_mapped; i++) {
> +		vma_entry_ptr = flex_array_get(fa, i);
> +		down_write(&current->mm->mmap_sem);
> +		do_munmap(current->mm, vma_entry_ptr->start,
> +			  vma_entry_ptr->end - vma_entry_ptr->start);
> +		up_write(&current->mm->mmap_sem);
> +	}
> +
> +	send_sig(SIGKILL, current, 0);
> +	goto out;
> +}
> Index: linux-2.6.git/fs/exec.c
> ===================================================================
> --- linux-2.6.git.orig/fs/exec.c
> +++ linux-2.6.git/fs/exec.c
> @@ -1071,18 +1071,10 @@ void set_task_comm(struct task_struct *t
>  	perf_event_comm(tsk);
>  }
>  
> -int flush_old_exec(struct linux_binprm * bprm)
> +int flush_exec_keep_thread(struct linux_binprm * bprm)
>  {
>  	int retval;
>  
> -	/*
> -	 * Make sure we have a private signal table and that
> -	 * we are unassociated from the previous thread group.
> -	 */
> -	retval = de_thread(current);
> -	if (retval)
> -		goto out;
> -
>  	set_mm_exe_file(bprm->mm, bprm->file);
>  
>  	/*
> @@ -1101,10 +1093,25 @@ int flush_old_exec(struct linux_binprm *
>  	current->personality &= ~bprm->per_clear;
>  
>  	return 0;
> -
>  out:
>  	return retval;
>  }
> +EXPORT_SYMBOL(flush_exec_keep_thread);
> +
> +int flush_old_exec(struct linux_binprm * bprm)
> +{
> +	int retval;
> +
> +	/*
> +	 * Make sure we have a private signal table and that
> +	 * we are unassociated from the previous thread group.
> +	 */
> +	retval = de_thread(current);
> +	if (retval)
> +		return retval;
> +
> +	return flush_exec_keep_thread(bprm);
> +}
>  EXPORT_SYMBOL(flush_old_exec);
>  
>  void would_dump(struct linux_binprm *bprm, struct file *file)
> Index: linux-2.6.git/include/linux/binfmts.h
> ===================================================================
> --- linux-2.6.git.orig/include/linux/binfmts.h
> +++ linux-2.6.git/include/linux/binfmts.h
> @@ -110,6 +110,7 @@ extern int prepare_binprm(struct linux_b
>  extern int __must_check remove_arg_zero(struct linux_binprm *);
>  extern int search_binary_handler(struct linux_binprm *, struct pt_regs *);
>  extern int flush_old_exec(struct linux_binprm * bprm);
> +extern int flush_exec_keep_thread(struct linux_binprm * bprm);
>  extern void setup_new_exec(struct linux_binprm * bprm);
>  extern void would_dump(struct linux_binprm *, struct file *);
>  
> Index: linux-2.6.git/include/linux/elf_ckpt.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.git/include/linux/elf_ckpt.h
> @@ -0,0 +1,103 @@
> +#ifndef _LINUX_ELF_CHECKPOINT_H
> +#define _LINUX_ELF_CHECKPOINT_H
> +
> +#ifdef __KERNEL__
> +
> +#include <linux/types.h>
> +#include <linux/elf-em.h>
> +
> +#include <asm/elf.h>
> +#include <asm/elf_ckpt.h>
> +
> +/*
> + * Elf extension includes new Elf file type
> + * and program header types as well.
> + */
> +#define ET_CKPT				5
> +
> +#define PT_CKPT_OFFSET			0x01010101
> +
> +#define PT_CKPT_VMA			(PT_LOOS + PT_CKPT_OFFSET + 1)
> +#define PT_CKPT_CORE			(PT_LOOS + PT_CKPT_OFFSET + 2)
> +#define PT_CKPT_PAGES			(PT_LOOS + PT_CKPT_OFFSET + 3)
> +
> +#define CKPT_PAGE_SIZE			4096
> +#define CKPT_TASK_COMM_LEN		16
> +
> +#define CKPT_HEADER_VERSION		1
> +#define CKPT_HEADER_ARCH_X86_64		1
> +
> +#define VMA_AREA_REGULAR		(1 <<  0)
> +#define VMA_AREA_STACK			(1 <<  1)
> +#define VMA_AREA_VSYSCALL		(1 <<  2)
> +#define VMA_AREA_VDSO			(1 <<  3)
> +#define VMA_FORCE_READ			(1 <<  4)
> +#define VMA_AREA_HEAP			(1 <<  5)
> +#define VMA_FILE_PRIVATE		(1 <<  6)
> +#define VMA_FILE_SHARED			(1 <<  7)
> +#define VMA_ANON_SHARED			(1 <<  8)
> +#define VMA_ANON_PRIVATE		(1 <<  9)
> +#define VMA_FORCE_WRITE			(1 << 10)
> +
> +struct vma_entry {
> +	__u64	start;
> +	__u64	end;
> +	__u64	pgoff;
> +	__u32	prot;
> +	__u32	flags;
> +	__u32	status;			/* from VMA_x above */
> +	__u32	pid;			/* pid VMA belongs to */
> +	__s64	fd;
> +	__u64	ino;
> +	__u32	dev_maj;
> +	__u32	dev_min;
> +} __packed;
> +
> +struct page_entry {
> +	__u64	va;			/* page virtual address */
> +	__u8	data[CKPT_PAGE_SIZE];	/* page contents */
> +} __packed;
> +
> +struct image_header {
> +	__u16	version;
> +	__u16	arch;
> +	__u32	flags;
> +} __packed;
> +
> +#define CKPT_ARCH_SIZE		(2 * 4096)
> +#define CKPT_CORE_SIZE		(4 * 4096)
> +
> +struct core_entry {
> +  union {
> +    struct {
> +	struct image_header		header;
> +	__u8				arch[CKPT_ARCH_SIZE];	/* should be enough for all archs */
> +	__u32				task_personality;
> +	__u8				task_comm[CKPT_TASK_COMM_LEN];
> +	__u32				task_flags;
> +	__u64				mm_start_code;
> +	__u64				mm_end_code;
> +	__u64				mm_start_data;
> +	__u64				mm_end_data;
> +	__u64				mm_start_stack;
> +	__u64				mm_start_brk;
> +	__u64				mm_brk;
> +    };
> +    __u8				__core_pad[CKPT_CORE_SIZE];
> +  };
> +} __packed;
> +
> +#ifdef CONFIG_BINFMT_ELF_CKPT
> +extern int load_elf_ckpt(struct linux_binprm *bprm, struct pt_regs *regs,
> +			 struct elfhdr *elf_ex, struct elf_phdr *elf_phdr);
> +#else
> +static inline int load_elf_ckpt(struct linux_binprm *bprm, struct pt_regs *regs,
> +				struct elfhdr *elf_ex, struct elf_phdr *elf_phdr)
> +{
> +	return -ENOEXEC;
> +}
> +#endif
> +
> +#endif /* __KERNEL__ */
> +
> +#endif /* _LINUX_ELF_CHECKPOINT_H */
> 

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-14 17:10   ` Tejun Heo
@ 2011-10-14 17:33     ` Tejun Heo
  2011-10-19  9:03       ` Pavel Emelyanov
  2011-10-15 18:59     ` Cyrill Gorcunov
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2011-10-14 17:33 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: linux-kernel, Andrew Vagin, Pavel Emelyanov, James Bottomley,
	Glauber Costa, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

Hello, again.

On Fri, Oct 14, 2011 at 10:10:33AM -0700, Tejun Heo wrote:
> I see that you removed zapping to allow restoring multi-threaded
> process, which seems quite scary to me.  It might be okay, I don't
> know, but IMHO it just isn't a very good idea to introduce such
> variation to basic exec behavior for this rather narrow use case.

* It's still calling exec_mmap(), so the exec'ing thread will be on
  the new mm while other threads would still be on the old one.

* There are operations which assume that the calling task is
  de-threaded and thus has exclusive access to data structures which
  can be shared among the thread group (e.g. signal struct).  How are
  accesses to them synchronized?

* What about properties which should be reset and then shared by
  threads by inheriting (credentials, comm, self_exec_id and so on)?
  e.g. new thread follows exec-time imposed rules but old ones keep
  their original credentials.

So, no, it really doesn't look good.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-14 17:10   ` Tejun Heo
  2011-10-14 17:33     ` Tejun Heo
@ 2011-10-15 18:59     ` Cyrill Gorcunov
  2011-10-21 11:06     ` Glauber Costa
  2011-10-22 16:49     ` Dan Merillat
  3 siblings, 0 replies; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-15 18:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Andrew Vagin, Pavel Emelyanov, James Bottomley,
	Glauber Costa, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Fri, Oct 14, 2011 at 10:10:33AM -0700, Tejun Heo wrote:
> Hello,
> 

Hi Tejun, sorry for a bit delayed reply,

> (cc'ing Oleg and Linus, and quoting whole body)
> 
> On Fri, Oct 14, 2011 at 03:04:21PM +0400, Cyrill Gorcunov wrote:
> > This patch add ability to run that named "checkpoint" files by
> > enhancing Elf file format, which includes
> > 
> >  - new Elf file type ET_CKPT
> > 
> >  - three additional program header types PT_CKPT_VMA, PT_CKPT_CORE
> >    and PT_CKPT_PAGES.
> > 
> >      PT_CKPT_VMA -- holds 'vma_entry' structure, which describes the
> >      memory area the kernel should map. It also might contain a file descriptor
> >      so the kernel will be mapping a file povided. Usually such file get
> >      opened by user-space helper tool which prepares 'vma_entry' structure
> >      for the kernel.
> > 
> >      PT_CKPT_CORE -- 'core_entry' structure (registers, tls, tasks specific
> >      settings). The structure is defined as a 16K container which should be
> >      enough for most cases. 8K of it is reserved for arch specific settings.
> > 
> >      PT_CKPT_PAGES -- a set of all pages which contents we should restored.
> > 
> > Apart from Elf extension flush_old_exec() has been splitted to two
> > functions -- the former flush_old_exec() and flush_exec_keep_thread().
> > The later doesn't call for de_thread() allowing to keep threads
> > relationship. Also arch_setup_additional_pages_at() helper added
> > to setup vdso at predefined address.
> > 
> > At moment only pure x86-64 architecture is supported.
> 
> I don't think this is a good idea.  We already have most of interface
> necessary for restoring task state and there's no need to put it into
> the kernel as one piece.  If there's something which can't be done
> from userland using existing interfaces, let's please discuss what
> they are and whether they can be resolved differently first.
> 

One of the problem is restoration of memory areas which are initially
provided by a kernel itself (such as vdso), the mm_struct members
such as start_code/end_code/brk and friends which are formerly set by
elf loader. Strictly speaking most of them (except brk) should not affect
process restored BUT they are exported into /proc which allows user space
to examinate them and hell knows how programs may react on such values
changed when they get restored (and I imagine a program which poll this
settings and exit out once they changed, _yes_ most programs should not
do that I believe but still... I thought if we implement c/r feature
the restored process should see himself exactly as it was before, at
least this is a final goal).

> The exec path is very intricate as it is and it would be very easy to
> introduce security or other bugs by altering its basic behavior.  exec
> presumes destruction of (most of) the current process and all its
> other threads and replacing them w/ fresh states from an executable.
> The scary part - interaction with process hierarchy and zapping of the
> current state - is handled from the core exec code.
>
> I see that you removed zapping to allow restoring multi-threaded
> process, which seems quite scary to me.  It might be okay, I don't
> know, but IMHO it just isn't a very good idea to introduce such
> variation to basic exec behavior for this rather narrow use case.
>
> In addition, leaving it alone doesn't really solve multi-threaded
> restoring, does it?  Who sets the task states for other threads?  The

The user-space tool. Eventually we are dreaming about restart with cgroups.
And the former process hierarchy is supposed to be recreated by user-space
tool as well.

Note I know that not all states are handled and most probably here some
bugs hidden in current patch as well, but it was the main reason to show
it up early, just to gather people opinions.

> current code doesn't seem to be doing anything but if you're gonna add
> it later, how are you gonna synchronize other threads?  If not, what's
> the point of pushing this chunk in when userland task state
> restoration is necessary anyway?
>
> So, at the moment, I'm rather strongly against this approach.
>

On Fri, Oct 14, 2011 at 10:10:33AM -0700, Tejun Heo wrote:
>> I see that you removed zapping to allow restoring multi-threaded
>> process, which seems quite scary to me.  It might be okay, I don't
>> know, but IMHO it just isn't a very good idea to introduce such
>> variation to basic exec behavior for this rather narrow use case.
>
>* It's still calling exec_mmap(), so the exec'ing thread will be on
>  the new mm while other threads would still be on the old one.

Hmm, seems I missed this in first place, thanks (note even having
new mm here the user-space tool will handle shared-memory allocations).

>* There are operations which assume that the calling task is
>  de-threaded and thus has exclusive access to data structures which
>  can be shared among the thread group (e.g. signal struct).  How are
>  accesses to them synchronized?

Seems it;s being not yet implemented, thanks.

>* What about properties which should be reset and then shared by
>  threads by inheriting (credentials, comm, self_exec_id and so on)?
>  e.g. new thread follows exec-time imposed rules but old ones keep
>  their original credentials.
>
>So, no, it really doesn't look good.

Anyway, thanks a HUGE for review, Tejun! I really appreciate!!

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-14 17:33     ` Tejun Heo
@ 2011-10-19  9:03       ` Pavel Emelyanov
  2011-10-19 18:22         ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Pavel Emelyanov @ 2011-10-19  9:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Cyrill Gorcunov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On 10/14/2011 09:33 PM, Tejun Heo wrote:
> Hello, again.

Hi!

Sorry for delay in response. I will answer on both of your concerns at once.

> I don't think this is a good idea.  We already have most of interface
> necessary for restoring task state and there's no need to put it into
> the kernel as one piece.  If there's something which can't be done
> from userland using existing interfaces, let's please discuss what
> they are and whether they can be resolved differently first.

The rest API we will require will look too special-purpose (like Cyrill
mentioned the ability to set the mm's fields such as start_code, etc).

Besides, putting a parasite code to restore the memory state will drive
us into certain limitations. E.g. - what address range in the target task
address space should the parasite occupy? What if we fail in finding one,
how should we act next?

> The exec path is very intricate as it is and it would be very easy to
> introduce security or other bugs by altering its basic behavior.  

Can you elaborate a bit more on this? "Other" bugs can be introduced by
any piece of code injected into the kernel, but what about security?

> exec presumes destruction of (most of) the current process and all its
> other threads and replacing them w/ fresh states from an executable.

This is *exactly* what is expected from the restoration process.

> The scary part - interaction with process hierarchy and zapping of the
> current state - is handled from the core exec code.

Right.

> I see that you removed zapping to allow restoring multi-threaded
> process, which seems quite scary to me.  It might be okay, I don't
> know, but IMHO it just isn't a very good idea to introduce such
> variation to basic exec behavior for this rather narrow use case.

Why? Can you describe your thought on this more, maybe I'm really missing
smth important.

> In addition, leaving it alone doesn't really solve multi-threaded
> restoring, does it?  

So your main concern is about multy-threaded tasks, right? I think we can
prepare the exec handler capable to show how this can look like.

> Who sets the task states for other threads? 

Which states? Pids and other IDs? Threads themselves. Memory state? The
exec handler itself. Registers? My plan is the same - the exec handler.

> The current code doesn't seem to be doing anything but if you're gonna add
> it later, how are you gonna synchronize other threads?

Synchronize what? If we're providing to a userspace a way to put things right
it's up to userspace to use it correctly. Like we saw this wit pid namespaces -
there's are plenty of ways how to kill the app with CLONE_NEWPID clone flag, but
the intention of one is not the be 100% fool-proof, but to provide an ability
to do what userspace need.

One of the abilities to restore multy-threaded task can be - clone threads from
the userspace, then wait for them to prepare themselves the way they need it,
then the threads go to sleep and the leader one calls execve. Thus the userspace
state consistency is OK.

> If not, what's the point of pushing this chunk in when userland task state
> restoration is necessary anyway?
> 
> So, at the moment, I'm rather strongly against this approach.

> On Fri, Oct 14, 2011 at 10:10:33AM -0700, Tejun Heo wrote:
>> I see that you removed zapping to allow restoring multi-threaded
>> process, which seems quite scary to me.  It might be okay, I don't
>> know, but IMHO it just isn't a very good idea to introduce such
>> variation to basic exec behavior for this rather narrow use case.
> 
> * It's still calling exec_mmap(), so the exec'ing thread will be on
>   the new mm while other threads would still be on the old one.

Why do you think it's a problem?

> * There are operations which assume that the calling task is
>   de-threaded and thus has exclusive access to data structures which
>   can be shared among the thread group (e.g. signal struct).  How are
>   accesses to them synchronized?

If we're talking about keeping the userspace state solid, then stopping the
other threads solves this.

> * What about properties which should be reset and then shared by
>   threads by inheriting (credentials, comm, self_exec_id and so on)?
>   e.g. new thread follows exec-time imposed rules but old ones keep
>   their original credentials.

I'm looking at the sys_setresuid and see that threads have independent creds
and thus they should be restored *before* calling exec.

> So, no, it really doesn't look good.

That's a pity. As I stated above, we'll try to demonstrate the exec handler
capable to restore the multy-threaded APP and send the 2nd RFC. Can you answer
my questions above so we keep your concerns in mind while doing this, please.

> Thanks.

Thank you for your feedback!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-19  9:03       ` Pavel Emelyanov
@ 2011-10-19 18:22         ` Tejun Heo
  2011-10-19 18:49           ` Cyrill Gorcunov
                             ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Tejun Heo @ 2011-10-19 18:22 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Cyrill Gorcunov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

Hello, Pavel.

On Wed, Oct 19, 2011 at 01:03:17PM +0400, Pavel Emelyanov wrote:
> > I don't think this is a good idea.  We already have most of interface
> > necessary for restoring task state and there's no need to put it into
> > the kernel as one piece.  If there's something which can't be done
> > from userland using existing interfaces, let's please discuss what
> > they are and whether they can be resolved differently first.
> 
> The rest API we will require will look too special-purpose (like Cyrill
> mentioned the ability to set the mm's fields such as start_code, etc).

I find that quite difficult to agree with.  We're talking about some
minor additions to prctl against updating exec path to do something it
was never designed to do + new binary format.

> Besides, putting a parasite code to restore the memory state will drive
> us into certain limitations. E.g. - what address range in the target task
> address space should the parasite occupy? What if we fail in finding one,
> how should we act next?

How is that different from snapshotting time?  Unless the address is
completely packed, I can't see how that can be a problem.  In the
worst case, you can use the parasite to map what's not overlapping and
then let the ptracer restore the overlapping areas after parasite is
done.

> > The exec path is very intricate as it is and it would be very easy to
> > introduce security or other bugs by altering its basic behavior.  
> 
> Can you elaborate a bit more on this? "Other" bugs can be introduced by
> any piece of code injected into the kernel, but what about security?

exec is a pretty important security boundary.  A lot of resources (fd,
VMAs, credentails) have security-sensitive policies forced across exec
boundary and there are enough places which depends on the all
resetting property of exec (e.g. no other user of thread-group wide
resources).

> > exec presumes destruction of (most of) the current process and all its
> > other threads and replacing them w/ fresh states from an executable.
> 
> This is *exactly* what is expected from the restoration process.

Umm.... so why does the patch skip zapping?  And if you are gonna be
zapping, how are you gonna do multiple threads?

> > I see that you removed zapping to allow restoring multi-threaded
> > process, which seems quite scary to me.  It might be okay, I don't
> > know, but IMHO it just isn't a very good idea to introduce such
> > variation to basic exec behavior for this rather narrow use case.
> 
> Why? Can you describe your thought on this more, maybe I'm really missing
> smth important.

Because it breaks one of the very basic assumptions - there are no
other tasks in the thread group and thus resources usually shared
among threadgroup can be assumed to be exclusive.

> > In addition, leaving it alone doesn't really solve multi-threaded
> > restoring, does it?  
> 
> So your main concern is about multy-threaded tasks, right? I think we can
> prepare the exec handler capable to show how this can look like.

I'm strongly against that direction regardless of implementation
details.

> > The current code doesn't seem to be doing anything but if you're gonna add
> > it later, how are you gonna synchronize other threads?
> 
> Synchronize what? If we're providing to a userspace a way to put things right
> it's up to userspace to use it correctly. Like we saw this wit pid namespaces -
> there's are plenty of ways how to kill the app with CLONE_NEWPID clone flag, but
> the intention of one is not the be 100% fool-proof, but to provide an ability
> to do what userspace need.

To shared kernel data structures.

> One of the abilities to restore multy-threaded task can be - clone threads from
> the userspace, then wait for them to prepare themselves the way they need it,
> then the threads go to sleep and the leader one calls execve. Thus the userspace
> state consistency is OK.

If you're gonna let userspace do it, why bother with in-kernel thing
at all?

> > * It's still calling exec_mmap(), so the exec'ing thread will be on
> >   the new mm while other threads would still be on the old one.
> 
> Why do you think it's a problem?

Ummm.... they aren't on the same address space?  How can that be okay?
It's not only wrong from CR POV.  The kernel disallows
CLONE_THREAD/SIGHAND w/o CLONE_VM and depend on that assumption,
you're breaking that.

> > * There are operations which assume that the calling task is
> >   de-threaded and thus has exclusive access to data structures which
> >   can be shared among the thread group (e.g. signal struct).  How are
> >   accesses to them synchronized?
> 
> If we're talking about keeping the userspace state solid, then stopping the
> other threads solves this.

Again, there are kernel exclusivity assumptions.

> That's a pity. As I stated above, we'll try to demonstrate the exec handler
> capable to restore the multy-threaded APP and send the 2nd RFC. Can you answer
> my questions above so we keep your concerns in mind while doing this, please.

IMHO, this is a fundamentally broken approach which isn't even
necessary.  So, FWIW, NACK.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-19 18:22         ` Tejun Heo
@ 2011-10-19 18:49           ` Cyrill Gorcunov
  2011-10-19 18:52             ` Cyrill Gorcunov
  2011-10-19 19:56           ` Cyrill Gorcunov
  2011-10-20  8:33           ` Pavel Emelyanov
  2 siblings, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-19 18:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Wed, Oct 19, 2011 at 11:22:28AM -0700, Tejun Heo wrote:
> Hello, Pavel.
> 
> On Wed, Oct 19, 2011 at 01:03:17PM +0400, Pavel Emelyanov wrote:
> > > I don't think this is a good idea.  We already have most of interface
> > > necessary for restoring task state and there's no need to put it into
> > > the kernel as one piece.  If there's something which can't be done
> > > from userland using existing interfaces, let's please discuss what
> > > they are and whether they can be resolved differently first.
> > 
> > The rest API we will require will look too special-purpose (like Cyrill
> > mentioned the ability to set the mm's fields such as start_code, etc).
> 
> I find that quite difficult to agree with.  We're talking about some
> minor additions to prctl against updating exec path to do something it
> was never designed to do + new binary format.
> 

Hi Tejun,

apart from controversy on elf'ish restore, I would like to know if there
a way to poke alien process memory with something faster than sizeof(long)
at once as ptrace proposes. At moment on my test tasks it's not that long
but the problem is there are servers with GBs of memory and snapshotting
memory data becomes a bottleneck. The same applies to restore procedure,
especially on big-data-chunks. Hm? Am I missing something obvious?

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-19 18:49           ` Cyrill Gorcunov
@ 2011-10-19 18:52             ` Cyrill Gorcunov
  2011-10-19 18:53               ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-19 18:52 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Wed, Oct 19, 2011 at 10:49:26PM +0400, Cyrill Gorcunov wrote:
..
> 
> Hi Tejun,
> 
> apart from controversy on elf'ish restore, I would like to know if there
> a way to poke alien process memory with something faster than sizeof(long)
> at once as ptrace proposes. At moment on my test tasks it's not that long
> but the problem is there are servers with GBs of memory and snapshotting
> memory data becomes a bottleneck. The same applies to restore procedure,
> especially on big-data-chunks. Hm? Am I missing something obvious?
> 

Drop it, I had something different in mind, sorry for noise.

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-19 18:52             ` Cyrill Gorcunov
@ 2011-10-19 18:53               ` Tejun Heo
  0 siblings, 0 replies; 33+ messages in thread
From: Tejun Heo @ 2011-10-19 18:53 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Emelyanov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

Hello,

On Wed, Oct 19, 2011 at 10:52:10PM +0400, Cyrill Gorcunov wrote:
> On Wed, Oct 19, 2011 at 10:49:26PM +0400, Cyrill Gorcunov wrote:
> ..
> > 
> > Hi Tejun,
> > 
> > apart from controversy on elf'ish restore, I would like to know if there
> > a way to poke alien process memory with something faster than sizeof(long)
> > at once as ptrace proposes. At moment on my test tasks it's not that long
> > but the problem is there are servers with GBs of memory and snapshotting
> > memory data becomes a bottleneck. The same applies to restore procedure,
> > especially on big-data-chunks. Hm? Am I missing something obvious?
> > 
> 
> Drop it, I had something different in mind, sorry for noise.

Heh, just in case.  Before resuming the restored process, the restorer
has full control.  Letting the target process reading and mmapping
itself should be enough, I think.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-19 18:22         ` Tejun Heo
  2011-10-19 18:49           ` Cyrill Gorcunov
@ 2011-10-19 19:56           ` Cyrill Gorcunov
  2011-10-21 18:26             ` Tejun Heo
  2011-10-20  8:33           ` Pavel Emelyanov
  2 siblings, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-19 19:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Wed, Oct 19, 2011 at 11:22:28AM -0700, Tejun Heo wrote:
> Hello, Pavel.
> 
> On Wed, Oct 19, 2011 at 01:03:17PM +0400, Pavel Emelyanov wrote:
> > > I don't think this is a good idea.  We already have most of interface
> > > necessary for restoring task state and there's no need to put it into
> > > the kernel as one piece.  If there's something which can't be done
> > > from userland using existing interfaces, let's please discuss what
> > > they are and whether they can be resolved differently first.
> > 
> > The rest API we will require will look too special-purpose (like Cyrill
> > mentioned the ability to set the mm's fields such as start_code, etc).
> 
> I find that quite difficult to agree with.  We're talking about some
> minor additions to prctl against updating exec path to do something it
> was never designed to do + new binary format.

Tejun, regardless the mm_struct entries, what about mapping vdso pages at the
place they had on snapshot time? How would we handle them?

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-19 18:22         ` Tejun Heo
  2011-10-19 18:49           ` Cyrill Gorcunov
  2011-10-19 19:56           ` Cyrill Gorcunov
@ 2011-10-20  8:33           ` Pavel Emelyanov
  2011-10-20 15:56             ` Tejun Heo
  2 siblings, 1 reply; 33+ messages in thread
From: Pavel Emelyanov @ 2011-10-20  8:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Cyrill Gorcunov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On 10/19/2011 10:22 PM, Tejun Heo wrote:
> Hello, Pavel.
> 
> On Wed, Oct 19, 2011 at 01:03:17PM +0400, Pavel Emelyanov wrote:
>>> I don't think this is a good idea.  We already have most of interface
>>> necessary for restoring task state and there's no need to put it into
>>> the kernel as one piece.  If there's something which can't be done
>>> from userland using existing interfaces, let's please discuss what
>>> they are and whether they can be resolved differently first.
>>
>> The rest API we will require will look too special-purpose (like Cyrill
>> mentioned the ability to set the mm's fields such as start_code, etc).
> 
> I find that quite difficult to agree with.  We're talking about some
> minor additions to prctl against updating exec path to do something it
> was never designed to do + new binary format.

No, we're talking about a bit more, at least:
* A bunch of fields on mm_struct
  ** mmap layout
  ** stack/code/brk addresses
  ** dumpable bit
  ** etc...
* The /proc/self/exe link
* Proper binfmt pointer on mm_struct
* VDSO (as Cyrill mentioned)

And creating a dedicated API for each of them looks ... strange.

Let's do the thing - we'll try to implement the restore without the exec and will
look at how many additional knobs we'll have to put into the kernel-to-user API, OK?

Besides, this will let us to compare the performance of both...

>> Besides, putting a parasite code to restore the memory state will drive
>> us into certain limitations. E.g. - what address range in the target task
>> address space should the parasite occupy? What if we fail in finding one,
>> how should we act next?
> 
> How is that different from snapshotting time?  

No difference, sure.

> Unless the address is completely packed, I can't see how that can be a problem.  
> In the worst case, you can use the parasite to map what's not overlapping and
> then let the ptracer restore the overlapping areas after parasite is
> done.

Yes, of course, taking into account the parasite code is fairly small to fit
the existing holes in address space. Otherwise poking the overlapping places
byte-by-byte will kill the performance.

Do we believe it will always be true?

>>> The exec path is very intricate as it is and it would be very easy to
>>> introduce security or other bugs by altering its basic behavior.  
>>
>> Can you elaborate a bit more on this? "Other" bugs can be introduced by
>> any piece of code injected into the kernel, but what about security?
> 
> exec is a pretty important security boundary.  A lot of resources (fd,
> VMAs, credentails) have security-sensitive policies forced across exec
> boundary and there are enough places which depends on the all
> resetting property of exec (e.g. no other user of thread-group wide
> resources).

:\ This explanation looks strange. Why then this zapping is performed in binary
handlers, rather than in generic exec code? Methinks this assumption can be
weakened if binary handler thinks it can handle it...

>>> exec presumes destruction of (most of) the current process and all its
>>> other threads and replacing them w/ fresh states from an executable.
>>
>> This is *exactly* what is expected from the restoration process.
> 
> Umm.... so why does the patch skip zapping?  And if you are gonna be
> zapping, how are you gonna do multiple threads?

It skips zapping to let the threads survive the exec.

>>> I see that you removed zapping to allow restoring multi-threaded
>>> process, which seems quite scary to me.  It might be okay, I don't
>>> know, but IMHO it just isn't a very good idea to introduce such
>>> variation to basic exec behavior for this rather narrow use case.
>>
>> Why? Can you describe your thought on this more, maybe I'm really missing
>> smth important.
> 
> Because it breaks one of the very basic assumptions - there are no
> other tasks in the thread group and thus resources usually shared
> among threadgroup can be assumed to be exclusive.

And are we doomed to this assumption due to the security reasons described
above?

>>> In addition, leaving it alone doesn't really solve multi-threaded
>>> restoring, does it?  
>>
>> So your main concern is about multy-threaded tasks, right? I think we can
>> prepare the exec handler capable to show how this can look like.
> 
> I'm strongly against that direction regardless of implementation
> details.
> 
>>> The current code doesn't seem to be doing anything but if you're gonna add
>>> it later, how are you gonna synchronize other threads?
>>
>> Synchronize what? If we're providing to a userspace a way to put things right
>> it's up to userspace to use it correctly. Like we saw this wit pid namespaces -
>> there's are plenty of ways how to kill the app with CLONE_NEWPID clone flag, but
>> the intention of one is not the be 100% fool-proof, but to provide an ability
>> to do what userspace need.
> 
> To shared kernel data structures.

Yet again - keeping shared stuctres self-consistent from which point of view? From
the kernel's one - already handled with in-kernel locks. From the userspace one -
every single kernel API provides a *way* to do things right, but doesn't provide
the fool protection.

>> One of the abilities to restore multy-threaded task can be - clone threads from
>> the userspace, then wait for them to prepare themselves the way they need it,
>> then the threads go to sleep and the leader one calls execve. Thus the userspace
>> state consistency is OK.
> 
> If you're gonna let userspace do it, why bother with in-kernel thing
> at all?

Because userspace cannot just flush itself memory and registers and switch to new
ones. Someone's help is required anyway. Did I misunderstood your question?

>>> * It's still calling exec_mmap(), so the exec'ing thread will be on
>>>   the new mm while other threads would still be on the old one.
>>
>> Why do you think it's a problem?
> 
> Ummm.... they aren't on the same address space?  How can that be okay?

The similar (but mirrored) thing with vfork - two tasks are on the same address space,
but the child is supposed not to kill the parent's one. And this is OK for most of the
apps using it.

> It's not only wrong from CR POV.  The kernel disallows
> CLONE_THREAD/SIGHAND w/o CLONE_VM and depend on that assumption,
> you're breaking that.

OK, if this is the only problem we can keep the mm_struct itself just unmapping the
vmas with unmap(0, 0xff...f) and repopulating one.

>>> * There are operations which assume that the calling task is
>>>   de-threaded and thus has exclusive access to data structures which
>>>   can be shared among the thread group (e.g. signal struct).  How are
>>>   accesses to them synchronized?
>>
>> If we're talking about keeping the userspace state solid, then stopping the
>> other threads solves this.
> 
> Again, there are kernel exclusivity assumptions.
> 
>> That's a pity. As I stated above, we'll try to demonstrate the exec handler
>> capable to restore the multy-threaded APP and send the 2nd RFC. Can you answer
>> my questions above so we keep your concerns in mind while doing this, please.
> 
> IMHO, this is a fundamentally broken approach which isn't even
> necessary.  So, FWIW, NACK.

It's a pity :( Anyway, as I stated above, we'll try to compare two real implementations,
not abstract assumptions.

> Thank you.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-20  8:33           ` Pavel Emelyanov
@ 2011-10-20 15:56             ` Tejun Heo
  2011-10-20 16:04               ` Cyrill Gorcunov
  2011-10-20 17:30               ` Pavel Emelyanov
  0 siblings, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2011-10-20 15:56 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Cyrill Gorcunov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

Hello,

On Thu, Oct 20, 2011 at 12:33:51PM +0400, Pavel Emelyanov wrote:
> > To shared kernel data structures.
> 
> Yet again - keeping shared stuctres self-consistent from which point of view? From
> the kernel's one - already handled with in-kernel locks. From the userspace one -
> every single kernel API provides a *way* to do things right, but doesn't provide
> the fool protection.

Sigh, no, in exec and related paths, the fact that all other threads
have been zapped are depended upon and the exec'ing task assumes
exclusive access to resources which in usual cases would require other
forms of exclusion.  Maybe there are not too many of them and things
can definitely be audited and updated but it is pretty intricate down
there and I just don't see good enough justification here.

> >> One of the abilities to restore multy-threaded task can be - clone threads from
> >> the userspace, then wait for them to prepare themselves the way they need it,
> >> then the threads go to sleep and the leader one calls execve. Thus the userspace
> >> state consistency is OK.
> > 
> > If you're gonna let userspace do it, why bother with in-kernel thing
> > at all?
> 
> Because userspace cannot just flush itself memory and registers and switch to new
> ones. Someone's help is required anyway. Did I misunderstood your question?

ptrace?

> >>> * It's still calling exec_mmap(), so the exec'ing thread will be on
> >>>   the new mm while other threads would still be on the old one.
> >>
> >> Why do you think it's a problem?
> > 
> > Ummm.... they aren't on the same address space?  How can that be okay?
> 
> The similar (but mirrored) thing with vfork - two tasks are on the same address space,
> but the child is supposed not to kill the parent's one. And this is OK for most of the
> apps using it.

I really don't know what to say here.  This is a clear bug.  You
should be switching mm's of other threads too if you want to go this
way.  vfork has have nothing to do with this.

> > It's not only wrong from CR POV.  The kernel disallows
> > CLONE_THREAD/SIGHAND w/o CLONE_VM and depend on that assumption,
> > you're breaking that.
> 
> OK, if this is the only problem we can keep the mm_struct itself just unmapping the
> vmas with unmap(0, 0xff...f) and repopulating one.

Yeah, and change another basic assumption without proper audit or
consideration for a feature which is borderline unnecessary?

> > IMHO, this is a fundamentally broken approach which isn't even
> > necessary.  So, FWIW, NACK.
> 
> It's a pity :( Anyway, as I stated above, we'll try to compare two
> real implementations, not abstract assumptions.

Given the state of this thread, I'm pretty skeptical this one can
survive but, hey, who knows?

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-20 15:56             ` Tejun Heo
@ 2011-10-20 16:04               ` Cyrill Gorcunov
  2011-10-20 17:30               ` Pavel Emelyanov
  1 sibling, 0 replies; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-20 16:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Thu, Oct 20, 2011 at 08:56:45AM -0700, Tejun Heo wrote:
...
> 
> > > IMHO, this is a fundamentally broken approach which isn't even
> > > necessary.  So, FWIW, NACK.
> > 
> > It's a pity :( Anyway, as I stated above, we'll try to compare two
> > real implementations, not abstract assumptions.
> 
> Given the state of this thread, I'm pretty skeptical this one can
> survive but, hey, who knows?
> 

Nope, if it eventually appear to be better -- there will be no point
to continue keeping elfism, and back, if it end up in bunch of levers
are to be injected into kernel, then definitely tuning up elf more
will be a winner. It's development process after all :) I'm working
on seized-restore atm, once I have something to show -- will ping you.

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-20 15:56             ` Tejun Heo
  2011-10-20 16:04               ` Cyrill Gorcunov
@ 2011-10-20 17:30               ` Pavel Emelyanov
  1 sibling, 0 replies; 33+ messages in thread
From: Pavel Emelyanov @ 2011-10-20 17:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Cyrill Gorcunov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On 10/20/2011 07:56 PM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Oct 20, 2011 at 12:33:51PM +0400, Pavel Emelyanov wrote:
>>> To shared kernel data structures.
>>
>> Yet again - keeping shared stuctres self-consistent from which point of view? From
>> the kernel's one - already handled with in-kernel locks. From the userspace one -
>> every single kernel API provides a *way* to do things right, but doesn't provide
>> the fool protection.
> 
> Sigh, no, in exec and related paths, the fact that all other threads
> have been zapped are depended upon and the exec'ing task assumes
> exclusive access to resources which in usual cases would require other
> forms of exclusion.  Maybe there are not too many of them and things
> can definitely be audited and updated but it is pretty intricate down
> there and I just don't see good enough justification here.

The amount of new kernel API is good justification. I've sent the list of what
should we allow to change from user-space for the sake of restore only.

Anyway - let's wait for Cyrill's work done in this area.

>>> IMHO, this is a fundamentally broken approach which isn't even
>>> necessary.  So, FWIW, NACK.
>>
>> It's a pity :( Anyway, as I stated above, we'll try to compare two
>> real implementations, not abstract assumptions.
> 
> Given the state of this thread, I'm pretty skeptical this one can
> survive but, hey, who knows?

That's good to hear that you're still open for discussion.

> Thank you.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-14 17:10   ` Tejun Heo
  2011-10-14 17:33     ` Tejun Heo
  2011-10-15 18:59     ` Cyrill Gorcunov
@ 2011-10-21 11:06     ` Glauber Costa
  2011-10-21 11:20       ` Cyrill Gorcunov
  2011-10-22 16:49     ` Dan Merillat
  3 siblings, 1 reply; 33+ messages in thread
From: Glauber Costa @ 2011-10-21 11:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Cyrill Gorcunov, linux-kernel, Andrew Vagin, Pavel Emelyanov,
	James Bottomley, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On 10/14/2011 09:10 PM, Tejun Heo wrote:
> Hello,
>
> (cc'ing Oleg and Linus, and quoting whole body)
>
> On Fri, Oct 14, 2011 at 03:04:21PM +0400, Cyrill Gorcunov wrote:
>> This patch add ability to run that named "checkpoint" files by
>> enhancing Elf file format, which includes
>>
>>   - new Elf file type ET_CKPT
>>
>>   - three additional program header types PT_CKPT_VMA, PT_CKPT_CORE
>>     and PT_CKPT_PAGES.
>>
>>       PT_CKPT_VMA -- holds 'vma_entry' structure, which describes the
>>       memory area the kernel should map. It also might contain a file descriptor
>>       so the kernel will be mapping a file povided. Usually such file get
>>       opened by user-space helper tool which prepares 'vma_entry' structure
>>       for the kernel.
>>
>>       PT_CKPT_CORE -- 'core_entry' structure (registers, tls, tasks specific
>>       settings). The structure is defined as a 16K container which should be
>>       enough for most cases. 8K of it is reserved for arch specific settings.
>>
>>       PT_CKPT_PAGES -- a set of all pages which contents we should restored.
>>
>> Apart from Elf extension flush_old_exec() has been splitted to two
>> functions -- the former flush_old_exec() and flush_exec_keep_thread().
>> The later doesn't call for de_thread() allowing to keep threads
>> relationship. Also arch_setup_additional_pages_at() helper added
>> to setup vdso at predefined address.
>>
>> At moment only pure x86-64 architecture is supported.
>
> I don't think this is a good idea.  We already have most of interface
> necessary for restoring task state and there's no need to put it into
> the kernel as one piece.  If there's something which can't be done
> from userland using existing interfaces, let's please discuss what
> they are and whether they can be resolved differently first.
>
> The exec path is very intricate as it is and it would be very easy to
> introduce security or other bugs by altering its basic behavior.  exec
> presumes destruction of (most of) the current process and all its
> other threads and replacing them w/ fresh states from an executable.
> The scary part - interaction with process hierarchy and zapping of the
> current state - is handled from the core exec code.
>
> I see that you removed zapping to allow restoring multi-threaded
> process, which seems quite scary to me.  It might be okay, I don't
> know, but IMHO it just isn't a very good idea to introduce such
> variation to basic exec behavior for this rather narrow use case.
Agreed.

exec() is a fundamental interface to the kernel, and the change proposed 
here is too disruptive. Not only that, it is rather unannounced: since 
not always one knows kind of fmt file is being exec'd, it gets hard to 
infer which behavior to expect.

I am wondering, though: if exec is a problem, but the binary handler is 
not, maybe we can exec a process using this handler, and then have the 
handler itself to create the thread hierarchy. This way we avoid 
changing exec() behavior at all, yet achieving the same results.

What do you think?



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-21 11:06     ` Glauber Costa
@ 2011-10-21 11:20       ` Cyrill Gorcunov
  2011-10-21 11:21         ` Glauber Costa
  0 siblings, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-21 11:20 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tejun Heo, linux-kernel, Andrew Vagin, Pavel Emelyanov,
	James Bottomley, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Fri, Oct 21, 2011 at 03:06:12PM +0400, Glauber Costa wrote:
...
> exec() is a fundamental interface to the kernel, and the change
> proposed here is too disruptive. Not only that, it is rather
> unannounced: since not always one knows kind of fmt file is being
> exec'd, it gets hard to infer which behavior to expect.
> 

This missed snipped in changelog indeed my very fault, sorry for that.

> I am wondering, though: if exec is a problem, but the binary handler
> is not, maybe we can exec a process using this handler, and then
> have the handler itself to create the thread hierarchy. This way we
> avoid changing exec() behavior at all, yet achieving the same
> results.
> 
> What do you think?
> 
> 

Glauber, could you please elaborate, you mean to call for forks inside
elf-chkpt handler, right? Or you mean something else?

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-21 11:20       ` Cyrill Gorcunov
@ 2011-10-21 11:21         ` Glauber Costa
  2011-10-21 11:35           ` Cyrill Gorcunov
  0 siblings, 1 reply; 33+ messages in thread
From: Glauber Costa @ 2011-10-21 11:21 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Tejun Heo, linux-kernel, Andrew Vagin, Pavel Emelyanov,
	James Bottomley, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On 10/21/2011 03:20 PM, Cyrill Gorcunov wrote:
> On Fri, Oct 21, 2011 at 03:06:12PM +0400, Glauber Costa wrote:
> ...
>> exec() is a fundamental interface to the kernel, and the change
>> proposed here is too disruptive. Not only that, it is rather
>> unannounced: since not always one knows kind of fmt file is being
>> exec'd, it gets hard to infer which behavior to expect.
>>
>
> This missed snipped in changelog indeed my very fault, sorry for that.
>
>> I am wondering, though: if exec is a problem, but the binary handler
>> is not, maybe we can exec a process using this handler, and then
>> have the handler itself to create the thread hierarchy. This way we
>> avoid changing exec() behavior at all, yet achieving the same
>> results.
>>
>> What do you think?
>>
>>
>
> Glauber, could you please elaborate, you mean to call for forks inside
> elf-chkpt handler, right? Or you mean something else?

not fork(), clone().


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-21 11:21         ` Glauber Costa
@ 2011-10-21 11:35           ` Cyrill Gorcunov
  0 siblings, 0 replies; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-21 11:35 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tejun Heo, linux-kernel, Andrew Vagin, Pavel Emelyanov,
	James Bottomley, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Fri, Oct 21, 2011 at 03:21:12PM +0400, Glauber Costa wrote:
...
> >>
> >
> >Glauber, could you please elaborate, you mean to call for forks inside
> >elf-chkpt handler, right? Or you mean something else?
> 
> not fork(), clone().
> 

I meant do_fork call. Anyway need to think about such approach, thanks!

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-19 19:56           ` Cyrill Gorcunov
@ 2011-10-21 18:26             ` Tejun Heo
  2011-10-21 18:36               ` Cyrill Gorcunov
  0 siblings, 1 reply; 33+ messages in thread
From: Tejun Heo @ 2011-10-21 18:26 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Emelyanov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

Hello,

On Wed, Oct 19, 2011 at 11:56:32PM +0400, Cyrill Gorcunov wrote:
> > I find that quite difficult to agree with.  We're talking about some
> > minor additions to prctl against updating exec path to do something it
> > was never designed to do + new binary format.
> 
> Tejun, regardless the mm_struct entries, what about mapping vdso pages at the
> place they had on snapshot time? How would we handle them?

We discussed this elsewhere but just for the record.

I haven't really looked into vdso but what the currently suggested
implementation isn't correct.  vdso is there so that kernel can
provide userspace code which may vary depending on kernel version and
the actual machine it's running on, so both saving vdso and restoring
verbatim and asking the kernel to map its vdso at certain address are
wrong.

The checkpointer would need to remember the vdso symtab from the
original process and then somehow build indirect jump table at the
same address which redirects to the vdso of the machine the target
process is being restored on.  I'm not sure about the details of the
implementation but it could be something like "please allow me to
write to this original vdso address and keep your vdso elsewhere".

I think this comes back to why one-stop-solution-in-kernel is a bad
idea for CR.  My impression is that when people say "that would
require too many small API updatesa", it usually means that they
haven't really thought about each necessary piece enough and just
hacked something up lumpy and fuzzy on the edges.  The thing is that
no matter how you lump them, those problems don't go away and
attacking things in small understandable API pieces not only ensures
the update itself makes sense and may be useful for others too but
also forces people to actually think about each problem.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-21 18:26             ` Tejun Heo
@ 2011-10-21 18:36               ` Cyrill Gorcunov
  2011-10-21 18:42                 ` Cyrill Gorcunov
  0 siblings, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-21 18:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Fri, Oct 21, 2011 at 11:26:10AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Oct 19, 2011 at 11:56:32PM +0400, Cyrill Gorcunov wrote:
> > > I find that quite difficult to agree with.  We're talking about some
> > > minor additions to prctl against updating exec path to do something it
> > > was never designed to do + new binary format.
> > 
> > Tejun, regardless the mm_struct entries, what about mapping vdso pages at the
> > place they had on snapshot time? How would we handle them?
> 
> We discussed this elsewhere but just for the record.
> 

Yes ;)

> I haven't really looked into vdso but what the currently suggested
> implementation isn't correct.  vdso is there so that kernel can
> provide userspace code which may vary depending on kernel version and
> the actual machine it's running on, so both saving vdso and restoring
> verbatim and asking the kernel to map its vdso at certain address are
> wrong.
> 

It's fine for first iteration (or rather for first RFC where such details
exactly to be revealed and discussed).

> The checkpointer would need to remember the vdso symtab from the
> original process and then somehow build indirect jump table at the
> same address which redirects to the vdso of the machine the target
> process is being restored on.  I'm not sure about the details of the
> implementation but it could be something like "please allow me to
> write to this original vdso address and keep your vdso elsewhere".
> 

Original vdso address should remain as it was at checkpoint time but
you're right that we need kinda bridge between various kernels versions
I think.

> I think this comes back to why one-stop-solution-in-kernel is a bad
> idea for CR.  My impression is that when people say "that would
> require too many small API updatesa", it usually means that they
> haven't really thought about each necessary piece enough and just
> hacked something up lumpy and fuzzy on the edges.  The thing is that
> no matter how you lump them, those problems don't go away and
> attacking things in small understandable API pieces not only ensures
> the update itself makes sense and may be useful for others too but
> also forces people to actually think about each problem.
> 

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-21 18:36               ` Cyrill Gorcunov
@ 2011-10-21 18:42                 ` Cyrill Gorcunov
  2011-10-21 18:48                   ` Tejun Heo
  0 siblings, 1 reply; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-21 18:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Fri, Oct 21, 2011 at 10:36:28PM +0400, Cyrill Gorcunov wrote:
...
> 
> > I think this comes back to why one-stop-solution-in-kernel is a bad
> > idea for CR.  My impression is that when people say "that would
> > require too many small API updatesa", it usually means that they
> > haven't really thought about each necessary piece enough and just

Don't claim such things, Tejun. Yes, I'm missing a number of things
(the same applies to you as well), but this patch was not 'just a
fast way to achieve what we need'. It's rather an attempt to gather
all things I've missed and provide more robust/solid version at the
next iteration (regargless which form this patch appeared, complete
user-space solution, some kernel api extension of whatever).

> > hacked something up lumpy and fuzzy on the edges.  The thing is that
> > no matter how you lump them, those problems don't go away and
> > attacking things in small understandable API pieces not only ensures
> > the update itself makes sense and may be useful for others too but
> > also forces people to actually think about each problem.
> > 
> 

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-21 18:42                 ` Cyrill Gorcunov
@ 2011-10-21 18:48                   ` Tejun Heo
  2011-10-21 18:53                     ` Cyrill Gorcunov
  2011-10-22  6:34                     ` Pavel Emelyanov
  0 siblings, 2 replies; 33+ messages in thread
From: Tejun Heo @ 2011-10-21 18:48 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Emelyanov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

Hello,

On Fri, Oct 21, 2011 at 10:42:39PM +0400, Cyrill Gorcunov wrote:
> On Fri, Oct 21, 2011 at 10:36:28PM +0400, Cyrill Gorcunov wrote:
> > > I think this comes back to why one-stop-solution-in-kernel is a bad
> > > idea for CR.  My impression is that when people say "that would
> > > require too many small API updatesa", it usually means that they
> > > haven't really thought about each necessary piece enough and just
> 
> Don't claim such things, Tejun. Yes, I'm missing a number of things
> (the same applies to you as well), but this patch was not 'just a
> fast way to achieve what we need'. It's rather an attempt to gather
> all things I've missed and provide more robust/solid version at the
> next iteration (regargless which form this patch appeared, complete
> user-space solution, some kernel api extension of whatever).

Hmmm... my point was that single lumpy implementation allows for such
mistakes to go unnoticed && makes it difficult to recover and thus is
a bad approach to the problem.  For example, unlikely but, let's say
we missed this vdso thing and it went into upstream, now what would we
do?  We'll need to somehow allow userland supply vdso symtab to kernel
and let the kernel do the bridging and then probably bump the version
number for elf format and so on.  IMHO, that just isn't a good way to
approach the problem, so let's please do it piece by piece.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-21 18:48                   ` Tejun Heo
@ 2011-10-21 18:53                     ` Cyrill Gorcunov
  2011-10-22  6:34                     ` Pavel Emelyanov
  1 sibling, 0 replies; 33+ messages in thread
From: Cyrill Gorcunov @ 2011-10-21 18:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pavel Emelyanov, linux-kernel@vger.kernel.org, Andrey Vagin,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Fri, Oct 21, 2011 at 11:48:17AM -0700, Tejun Heo wrote:
...
> > 
> > Don't claim such things, Tejun. Yes, I'm missing a number of things
> > (the same applies to you as well), but this patch was not 'just a
> > fast way to achieve what we need'. It's rather an attempt to gather
> > all things I've missed and provide more robust/solid version at the
> > next iteration (regargless which form this patch appeared, complete
> > user-space solution, some kernel api extension of whatever).
> 
> Hmmm... my point was that single lumpy implementation allows for such
> mistakes to go unnoticed && makes it difficult to recover and thus is

Agreed, big code chunks are hard to even simply review, not talking about
notice of some deeper mistakes (espec in code logic and so on).

> a bad approach to the problem.  For example, unlikely but, let's say
> we missed this vdso thing and it went into upstream, now what would we
> do?  We'll need to somehow allow userland supply vdso symtab to kernel
> and let the kernel do the bridging and then probably bump the version
> number for elf format and so on.  IMHO, that just isn't a good way to
> approach the problem, so let's please do it piece by piece.
> 

Yes, I'll reduce code next time, for sure, thanks!

	Cyrill

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-21 18:48                   ` Tejun Heo
  2011-10-21 18:53                     ` Cyrill Gorcunov
@ 2011-10-22  6:34                     ` Pavel Emelyanov
  1 sibling, 0 replies; 33+ messages in thread
From: Pavel Emelyanov @ 2011-10-22  6:34 UTC (permalink / raw)
  To: Tejun Heo, Cyrill Gorcunov
  Cc: linux-kernel@vger.kernel.org, Andrey Vagin, James Bottomley,
	Glauber Costa, H. Peter Anvin, Ingo Molnar, Dave Hansen,
	Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On 10/21/2011 10:48 PM, Tejun Heo wrote:
> Hello,
> 
> On Fri, Oct 21, 2011 at 10:42:39PM +0400, Cyrill Gorcunov wrote:
>> On Fri, Oct 21, 2011 at 10:36:28PM +0400, Cyrill Gorcunov wrote:
>>>> I think this comes back to why one-stop-solution-in-kernel is a bad
>>>> idea for CR.  My impression is that when people say "that would
>>>> require too many small API updatesa", it usually means that they
>>>> haven't really thought about each necessary piece enough and just
>>
>> Don't claim such things, Tejun. Yes, I'm missing a number of things
>> (the same applies to you as well), but this patch was not 'just a
>> fast way to achieve what we need'. It's rather an attempt to gather
>> all things I've missed and provide more robust/solid version at the
>> next iteration (regargless which form this patch appeared, complete
>> user-space solution, some kernel api extension of whatever).
> 
> Hmmm... my point was that single lumpy implementation allows for such
> mistakes to go unnoticed && makes it difficult to recover and thus is
> a bad approach to the problem.  For example, unlikely but, let's say
> we missed this vdso thing and it went into upstream, now what would we
> do?  We'll need to somehow allow userland supply vdso symtab to kernel
> and let the kernel do the bridging and then probably bump the version
> number for elf format and so on.  IMHO, that just isn't a good way to
> approach the problem, so let's please do it piece by piece.

What mistakes are we talking about?

With the seized restore we prepare the VDSO thing in the userspace and
then push this in to the kernel (let alone the question how we make the
kernel know this page IS VDSO), with the exec-ed restore we also prepare 
this VDSO in the userspace and then push this into the kernel as well, but
via the binary handler. This handler is extremely simple thing and not
supposed to do complex logic - just call do_mmap_pgoff() and write pages 
in there in cycle, which is generic enough even for VDSO, no?

What am I missing here, where is the difference? Don't get me wrong, I'm
not mocking, just assume that we can all miss smth from each other.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [patch 5/5] elf: Add support for loading ET_CKPT files
  2011-10-14 17:10   ` Tejun Heo
                       ` (2 preceding siblings ...)
  2011-10-21 11:06     ` Glauber Costa
@ 2011-10-22 16:49     ` Dan Merillat
  3 siblings, 0 replies; 33+ messages in thread
From: Dan Merillat @ 2011-10-22 16:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Cyrill Gorcunov, linux-kernel, Andrew Vagin, Pavel Emelyanov,
	James Bottomley, Glauber Costa, H. Peter Anvin, Ingo Molnar,
	Dave Hansen, Eric W. Biederman, Daniel Lezcano, Alexey Dobriyan,
	Linus Torvalds, Oleg Nesterov

On Fri, Oct 14, 2011 at 1:10 PM, Tejun Heo <tj@kernel.org> wrote:
> I don't think this is a good idea.  We already have most of interface
> necessary for restoring task state and there's no need to put it into
> the kernel as one piece.  If there's something which can't be done
> from userland using existing interfaces, let's please discuss what
> they are and whether they can be resolved differently first.

I may be missing something, but programs do care about their and their
children's PIDs, and even with a kernel interface to pick a specific
PID for a new process (yikes!) it wouldn't work if the original PID
was in use by another process.

Open file handles that point to files that have been changed/deleted
while the process was frozen, unlinked scratch files that are gone for
good when the original process dies, shared memory, processes at the
other end of pipes, etc.  Graphical programs get even worse, with
trying to reproduce the X11 state or GPU state.   Network isn't as
bad, as it's expected that it could be dropped at any time so most
programs can handle it.

If the userspace to be checkpointed can cope with all that, it can
handle moving the vdso mapping.   Since it has to have specific logic
to handle all the different types of , why not just have it dump
internal state on a signal and rerun with that file as an argument?
The target process that can deal with being externally frozen would
seem to be basically a tech demo.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2011-10-22 16:49 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-14 11:04 [patch 0/5] [RFC] Checkpoint/restore and Elf extension Cyrill Gorcunov
2011-10-14 11:04 ` [patch 1/5] proc: Introduce the Children: line in /proc/<pid>/status Cyrill Gorcunov
2011-10-14 16:36   ` Tejun Heo
2011-10-14 11:04 ` [patch 2/5] fs: Add do_close helper Cyrill Gorcunov
2011-10-14 11:04 ` [patch 3/5] fs, proc: Add /proc/$pid/tls entry Cyrill Gorcunov
2011-10-14 16:40   ` Tejun Heo
2011-10-14 16:43     ` Cyrill Gorcunov
2011-10-14 11:04 ` [patch 4/5] fs, proc: Add start_data, end_data, start_brk members to /proc/$pid/stat Cyrill Gorcunov
2011-10-14 11:04 ` [patch 5/5] elf: Add support for loading ET_CKPT files Cyrill Gorcunov
2011-10-14 17:10   ` Tejun Heo
2011-10-14 17:33     ` Tejun Heo
2011-10-19  9:03       ` Pavel Emelyanov
2011-10-19 18:22         ` Tejun Heo
2011-10-19 18:49           ` Cyrill Gorcunov
2011-10-19 18:52             ` Cyrill Gorcunov
2011-10-19 18:53               ` Tejun Heo
2011-10-19 19:56           ` Cyrill Gorcunov
2011-10-21 18:26             ` Tejun Heo
2011-10-21 18:36               ` Cyrill Gorcunov
2011-10-21 18:42                 ` Cyrill Gorcunov
2011-10-21 18:48                   ` Tejun Heo
2011-10-21 18:53                     ` Cyrill Gorcunov
2011-10-22  6:34                     ` Pavel Emelyanov
2011-10-20  8:33           ` Pavel Emelyanov
2011-10-20 15:56             ` Tejun Heo
2011-10-20 16:04               ` Cyrill Gorcunov
2011-10-20 17:30               ` Pavel Emelyanov
2011-10-15 18:59     ` Cyrill Gorcunov
2011-10-21 11:06     ` Glauber Costa
2011-10-21 11:20       ` Cyrill Gorcunov
2011-10-21 11:21         ` Glauber Costa
2011-10-21 11:35           ` Cyrill Gorcunov
2011-10-22 16:49     ` Dan Merillat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).