* [PATCH 0/9] OpenVZ kernel based checkpointing/restart
@ 2008-09-03 10:57 Andrey Mirkin
2008-09-03 10:57 ` [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls Andrey Mirkin
` (5 more replies)
0 siblings, 6 replies; 69+ messages in thread
From: Andrey Mirkin @ 2008-09-03 10:57 UTC (permalink / raw)
To: linux-kernel; +Cc: containers, Andrey Mirkin
This patchset introduces kernel based checkpointing/restart as it is
implemented in OpenVZ project. This patchset has limited functionality and
are able to checkpoint/restart only single process. Recently Oren Laaden
sent another kernel based implementation of checkpoint/restart. The main
differences between this patchset and Oren's patchset are:
* In this patchset checkpointing initiated not from the process
(right now we do not have a container, only namespaces), Oren's patchset
performs checkpointing from the process context.
* Restart in this patchset is initiated from process, which restarts a new
process (in new namespaces) with saved state. Oren's patchset uses the same
process from which restart was initiated and restore saved state over it.
* Checkpoint/restart functionality in this patchset is implemented as a kernel
module
As checkpointing is initiated not from the process which state should be saved
we should freeze a process before saving its state. Right now Container Freezer
from Matt Helsley can be used for this.
This patchset introduce only a concept how kernel based checkpointing/restart
can be implemented and are able to checkpoint/restart only a single process
with simple VMAs.
I've tried to split my patchset in small patches to make review more easier.
^ permalink raw reply [flat|nested] 69+ messages in thread* [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls 2008-09-03 10:57 [PATCH 0/9] OpenVZ kernel based checkpointing/restart Andrey Mirkin @ 2008-09-03 10:57 ` Andrey Mirkin 2008-09-03 10:57 ` [PATCH 2/9] Make checkpoint/restart functionality modular Andrey Mirkin 2008-09-03 11:44 ` [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls Cedric Le Goater 2008-09-03 12:28 ` [PATCH 0/9] OpenVZ kernel based checkpointing/restart Cedric Le Goater ` (4 subsequent siblings) 5 siblings, 2 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 10:57 UTC (permalink / raw) To: linux-kernel; +Cc: containers, Andrey Mirkin Right now they just return -ENOSYS. Later they will provide functionality to checkpoint and restart a container. Both syscalls take as arguments a file descriptor and flags. Also sys_checkpoint take as the first argument a PID of container's init (later it will be container ID); sys_restart takes as the first argument a container ID (right now it will not be used). Signed-off-by: Andrey Mirkin <major@openvz.org> --- Makefile | 2 +- arch/x86/kernel/syscall_table_32.S | 2 + cpt/Makefile | 1 + cpt/sys_core.c | 38 ++++++++++++++++++++++++++++++++++++ include/asm-x86/unistd_32.h | 2 + 5 files changed, 44 insertions(+), 1 deletions(-) create mode 100644 cpt/Makefile create mode 100644 cpt/sys_core.c diff --git a/Makefile b/Makefile index ea413fa..1dee5c0 100644 --- a/Makefile +++ b/Makefile @@ -619,7 +619,7 @@ export mod_strip_cmd ifeq ($(KBUILD_EXTMOD),) -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ cpt/ vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index fd9d4f4..4a0d7fb 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -333,3 +333,5 @@ ENTRY(sys_call_table) .long sys_pipe2 .long sys_inotify_init1 .long sys_hijack + .long sys_checkpoint + .long sys_restart /* 335 */ diff --git a/cpt/Makefile b/cpt/Makefile new file mode 100644 index 0000000..2276fb1 --- /dev/null +++ b/cpt/Makefile @@ -0,0 +1 @@ +obj-y += sys_core.o diff --git a/cpt/sys_core.c b/cpt/sys_core.c new file mode 100644 index 0000000..1a97fb6 --- /dev/null +++ b/cpt/sys_core.c @@ -0,0 +1,38 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Author: Andrey Mirkin <major@openvz.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + * + */ + +#include <linux/sched.h> +#include <linux/fs.h> +#include <linux/file.h> + +/** + * sys_checkpoint - checkpoint a container from outside + * @pid: pid of the container init(1) process + * TODO: should switch to container id later + * @fd: file to which save the checkpoint image + * @flags: checkpoint operation flags + */ +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags) +{ + return -ENOSYS; +} + +/** + * sys_restart - restart a container + * @ctid: container id which should be used to restart a container + * @fd: file from which read the checkpoint image + * @flags: restart operation flags + */ +asmlinkage long sys_restart(int ctid, int fd, unsigned long flags) +{ + return -ENOSYS; +} diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h index 70280da..1a09604 100644 --- a/include/asm-x86/unistd_32.h +++ b/include/asm-x86/unistd_32.h @@ -339,6 +339,8 @@ #define __NR_pipe2 331 #define __NR_inotify_init1 332 #define __NR_hijack 333 +#define __NR_checkpoint 334 +#define __NR_restart 335 #ifdef __KERNEL__ -- 1.5.6 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 2/9] Make checkpoint/restart functionality modular 2008-09-03 10:57 ` [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls Andrey Mirkin @ 2008-09-03 10:57 ` Andrey Mirkin 2008-09-03 10:57 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Andrey Mirkin 2008-09-03 14:27 ` [PATCH 2/9] Make checkpoint/restart functionality modular Serge E. Hallyn 2008-09-03 11:44 ` [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls Cedric Le Goater 1 sibling, 2 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 10:57 UTC (permalink / raw) To: linux-kernel; +Cc: containers, Andrey Mirkin A config option CONFIG_CHECKPOINT is introduced. New structure cpt_operations is introduced to store pointers to checkpoint/restart functions from module. Signed-off-by: Andrey Mirkin <major@openvz.org> --- cpt/Kconfig | 7 +++++++ cpt/Makefile | 4 ++++ cpt/cpt.h | 19 +++++++++++++++++++ cpt/sys.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ cpt/sys_core.c | 29 +++++++++++++++++++++++++++-- init/Kconfig | 2 ++ 6 files changed, 107 insertions(+), 2 deletions(-) create mode 100644 cpt/Kconfig create mode 100644 cpt/cpt.h create mode 100644 cpt/sys.c diff --git a/cpt/Kconfig b/cpt/Kconfig new file mode 100644 index 0000000..b9bc72d --- /dev/null +++ b/cpt/Kconfig @@ -0,0 +1,7 @@ +config CHECKPOINT + tristate "Checkpoint & restart for containers" + depends on EXPERIMENTAL + default n + help + This option adds module "cptrst", which allow to save a running + container to a file and restart it later using this image file. diff --git a/cpt/Makefile b/cpt/Makefile index 2276fb1..bfe75d5 100644 --- a/cpt/Makefile +++ b/cpt/Makefile @@ -1 +1,5 @@ obj-y += sys_core.o + +obj-$(CONFIG_CHECKPOINT) += cptrst.o + +cptrst-objs := sys.o diff --git a/cpt/cpt.h b/cpt/cpt.h new file mode 100644 index 0000000..381a9bf --- /dev/null +++ b/cpt/cpt.h @@ -0,0 +1,19 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Author: Andrey Mirkin <major@openvz.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + * + */ + +struct cpt_operations +{ + struct module * owner; + int (*checkpoint)(pid_t pid, int fd, unsigned long flags); + int (*restart)(int ctid, int fd, unsigned long flags); +}; +extern struct cpt_operations cpt_ops; diff --git a/cpt/sys.c b/cpt/sys.c new file mode 100644 index 0000000..4051286 --- /dev/null +++ b/cpt/sys.c @@ -0,0 +1,48 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Author: Andrey Mirkin <major@openvz.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + * + */ + +#include <linux/sched.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/notifier.h> +#include <linux/module.h> + +#include "cpt.h" + +MODULE_LICENSE("GPL"); + +static int checkpoint(pid_t pid, int fd, unsigned long flags) +{ + return -ENOSYS; +} + +static int restart(int ctid, int fd, unsigned long flags) +{ + return -ENOSYS; +} + +static int __init init_cptrst(void) +{ + cpt_ops.owner = THIS_MODULE; + cpt_ops.checkpoint = checkpoint; + cpt_ops.restart = restart; + return 0; +} +module_init(init_cptrst); + +static void __exit exit_cptrst(void) +{ + cpt_ops.checkpoint = NULL; + cpt_ops.restart = NULL; + cpt_ops.owner = NULL; +} +module_exit(exit_cptrst); diff --git a/cpt/sys_core.c b/cpt/sys_core.c index 1a97fb6..5dd1191 100644 --- a/cpt/sys_core.c +++ b/cpt/sys_core.c @@ -13,6 +13,13 @@ #include <linux/sched.h> #include <linux/fs.h> #include <linux/file.h> +#include <linux/notifier.h> +#include <linux/module.h> + +#include "cpt.h" + +struct cpt_operations cpt_ops = { NULL, NULL, NULL }; +EXPORT_SYMBOL(cpt_ops); /** * sys_checkpoint - checkpoint a container from outside @@ -23,7 +30,16 @@ */ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags) { - return -ENOSYS; + int ret; + + ret = -ENOSYS; + + if (try_module_get(cpt_ops.owner)) { + if (cpt_ops.checkpoint) + ret = cpt_ops.checkpoint(pid, fd, flags); + module_put(cpt_ops.owner); + } + return ret; } /** @@ -34,5 +50,14 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags) */ asmlinkage long sys_restart(int ctid, int fd, unsigned long flags) { - return -ENOSYS; + int ret; + + ret = -ENOSYS; + + if (try_module_get(cpt_ops.owner)) { + if (cpt_ops.restart) + ret = cpt_ops.restart(ctid, fd, flags); + module_put(cpt_ops.owner); + } + return ret; } diff --git a/init/Kconfig b/init/Kconfig index 4bd4b0c..d29ed21 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -344,6 +344,8 @@ config CGROUP_FREEZER Provides a way to freeze and unfreeze all tasks in a cgroup +source "cpt/Kconfig" + config FAIR_GROUP_SCHED bool "Group scheduling for SCHED_OTHER" depends on GROUP_SCHED -- 1.5.6 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 3/9] Introduce context structure needed during checkpointing/restart 2008-09-03 10:57 ` [PATCH 2/9] Make checkpoint/restart functionality modular Andrey Mirkin @ 2008-09-03 10:57 ` Andrey Mirkin 2008-09-03 10:57 ` [PATCH 4/9] Introduce container dump function Andrey Mirkin ` (3 more replies) 2008-09-03 14:27 ` [PATCH 2/9] Make checkpoint/restart functionality modular Serge E. Hallyn 1 sibling, 4 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 10:57 UTC (permalink / raw) To: linux-kernel; +Cc: containers, Andrey Mirkin Add functions for context allocation/destroy. Introduce functions to read/write image. Introduce image header and object header. Signed-off-by: Andrey Mirkin <major@openvz.org> --- cpt/cpt.h | 37 ++++++++++++++++ cpt/cpt_image.h | 63 +++++++++++++++++++++++++++ cpt/sys.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 224 insertions(+), 3 deletions(-) create mode 100644 cpt/cpt_image.h diff --git a/cpt/cpt.h b/cpt/cpt.h index 381a9bf..607ac1b 100644 --- a/cpt/cpt.h +++ b/cpt/cpt.h @@ -10,6 +10,8 @@ * */ +#include "cpt_image.h" + struct cpt_operations { struct module * owner; @@ -17,3 +19,38 @@ struct cpt_operations int (*restart)(int ctid, int fd, unsigned long flags); }; extern struct cpt_operations cpt_ops; + +#define CPT_CTX_ERROR -1 +#define CPT_CTX_IDLE 0 +#define CPT_CTX_DUMPING 1 +#define CPT_CTX_UNDUMPING 2 + +typedef struct cpt_context +{ + pid_t pid; /* should be changed to ctid later */ + int ctx_id; /* context id */ + struct list_head ctx_list; + int refcount; + int ctx_state; + struct semaphore main_sem; + + int errno; + + struct file *file; + loff_t current_object; + + struct list_head object_array[CPT_OBJ_MAX]; + + int (*write)(const void *addr, size_t count, struct cpt_context *ctx); + int (*read)(void *addr, size_t count, struct cpt_context *ctx); +} cpt_context_t; + +extern int debug_level; + +#define cpt_printk(lvl, fmt, args...) do { \ + if (lvl <= debug_level) \ + printk(fmt, ##args); \ + } while (0) + +#define eprintk(a...) cpt_printk(1, "CPT ERR: " a) +#define dprintk(a...) cpt_printk(1, "CPT DBG: " a) diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h new file mode 100644 index 0000000..3d26229 --- /dev/null +++ b/cpt/cpt_image.h @@ -0,0 +1,63 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Author: Andrey Mirkin <major@openvz.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + * + */ + +#ifndef __CPT_IMAGE_H_ +#define __CPT_IMAGE_H_ 1 + +enum _cpt_object_type +{ + CPT_OBJ_TASK = 0, + CPT_OBJ_MAX, + /* The objects above are stored in memory while checkpointing */ + + CPT_OBJ_HEAD = 1024, +}; + +enum _cpt_content_type { + CPT_CONTENT_VOID, + CPT_CONTENT_ARRAY, + CPT_CONTENT_DATA, + CPT_CONTENT_NAME, + CPT_CONTENT_REF, + CPT_CONTENT_MAX +}; + +#define CPT_SIGNATURE0 0x79 +#define CPT_SIGNATURE1 0x1c +#define CPT_SIGNATURE2 0x01 +#define CPT_SIGNATURE3 0x63 + +struct cpt_head +{ + __u8 cpt_signature[4]; /* Magic number */ + __u32 cpt_hdrlen; /* Header length */ + __u16 cpt_image_major; /* Format of this file */ + __u16 cpt_image_minor; /* Format of this file */ + __u16 cpt_image_sublevel; /* Format of this file */ + __u16 cpt_image_extra; /* Format of this file */ + __u16 cpt_arch; /* Architecture */ + __u16 cpt_pad1; + __u32 cpt_pad2; +#define CPT_ARCH_I386 0 + __u64 cpt_time; /* Time */ +} __attribute__ ((aligned (8))); + +/* Common object header. */ +struct cpt_object_hdr +{ + __u64 cpt_len; /* Size of current chunk of data */ + __u16 cpt_type; /* Type of object */ + __u32 cpt_hdrlen; /* Size of header */ + __u16 cpt_content; /* Content type: array, reference... */ +} __attribute__ ((aligned (8))); + +#endif /* __CPT_IMAGE_H_ */ diff --git a/cpt/sys.c b/cpt/sys.c index 4051286..8334c4c 100644 --- a/cpt/sys.c +++ b/cpt/sys.c @@ -13,21 +13,142 @@ #include <linux/sched.h> #include <linux/fs.h> #include <linux/file.h> -#include <linux/notifier.h> +#include <linux/uaccess.h> #include <linux/module.h> #include "cpt.h" +#include "cpt_image.h" MODULE_LICENSE("GPL"); +/* Debug level, constant for now */ +int debug_level = 1; + +static int file_write(const void *addr, size_t count, struct cpt_context *ctx) +{ + mm_segment_t oldfs; + ssize_t err = -EBADF; + struct file *file = ctx->file; + + oldfs = get_fs(); set_fs(KERNEL_DS); + if (file) + err = file->f_op->write(file, addr, count, &file->f_pos); + set_fs(oldfs); + if (err != count) + return err >= 0 ? -EIO : err; + return 0; +} + +static int file_read(void *addr, size_t count, struct cpt_context *ctx) +{ + mm_segment_t oldfs; + ssize_t err = -EBADF; + struct file *file = ctx->file; + + oldfs = get_fs(); set_fs(KERNEL_DS); + if (file) + err = file->f_op->read(file, addr, count, &file->f_pos); + set_fs(oldfs); + if (err != count) + return err >= 0 ? -EIO : err; + return 0; +} + +struct cpt_context * context_alloc(void) +{ + struct cpt_context *ctx; + int i; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + init_MUTEX(&ctx->main_sem); + ctx->refcount = 1; + + ctx->current_object = -1; + ctx->write = file_write; + ctx->read = file_read; + for (i = 0; i < CPT_OBJ_MAX; i++) { + INIT_LIST_HEAD(&ctx->object_array[i]); + } + + return ctx; +} + +void context_release(struct cpt_context *ctx) +{ + ctx->ctx_state = CPT_CTX_ERROR; + + if (ctx->file) + fput(ctx->file); + kfree(ctx); +} + +static void context_put(struct cpt_context *ctx) +{ + if (!--ctx->refcount) + context_release(ctx); +} + static int checkpoint(pid_t pid, int fd, unsigned long flags) { - return -ENOSYS; + struct file *file; + struct cpt_context *ctx; + int err; + + err = -EBADF; + file = fget(fd); + if (!file) + goto out; + + err = -ENOMEM; + ctx = context_alloc(); + if (!ctx) + goto out_file; + + ctx->file = file; + ctx->ctx_state = CPT_CTX_DUMPING; + + /* checkpoint */ + err = -ENOSYS; + + context_put(ctx); + +out_file: + fput(file); +out: + return err; } static int restart(int ctid, int fd, unsigned long flags) { - return -ENOSYS; + struct file *file; + struct cpt_context *ctx; + int err; + + err = -EBADF; + file = fget(fd); + if (!file) + goto out; + + err = -ENOMEM; + ctx = context_alloc(); + if (!ctx) + goto out_file; + + ctx->file = file; + ctx->ctx_state = CPT_CTX_UNDUMPING; + + /* restart */ + err = -ENOSYS; + + context_put(ctx); + +out_file: + fput(file); +out: + return err; } static int __init init_cptrst(void) -- 1.5.6 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 4/9] Introduce container dump function 2008-09-03 10:57 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Andrey Mirkin @ 2008-09-03 10:57 ` Andrey Mirkin 2008-09-03 10:57 ` [PATCH 5/9] Introduce function to dump process Andrey Mirkin 2008-09-03 14:23 ` [PATCH 4/9] Introduce container dump function Serge E. Hallyn 2008-09-03 12:29 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Matthieu Fertré ` (2 subsequent siblings) 3 siblings, 2 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 10:57 UTC (permalink / raw) To: linux-kernel; +Cc: containers, Andrey Mirkin Actually right now we are going to dump only one process. Function for dumping head of image file are added. Signed-off-by: Andrey Mirkin <major@openvz.org> --- cpt/Makefile | 2 +- cpt/checkpoint.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ cpt/cpt.h | 3 ++ cpt/sys.c | 3 +- kernel/fork.c | 2 + 5 files changed, 82 insertions(+), 2 deletions(-) create mode 100644 cpt/checkpoint.c diff --git a/cpt/Makefile b/cpt/Makefile index bfe75d5..173346b 100644 --- a/cpt/Makefile +++ b/cpt/Makefile @@ -2,4 +2,4 @@ obj-y += sys_core.o obj-$(CONFIG_CHECKPOINT) += cptrst.o -cptrst-objs := sys.o +cptrst-objs := sys.o checkpoint.o diff --git a/cpt/checkpoint.c b/cpt/checkpoint.c new file mode 100644 index 0000000..b4d9686 --- /dev/null +++ b/cpt/checkpoint.c @@ -0,0 +1,74 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Author: Andrey Mirkin <major@openvz.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + * + */ + +#include <linux/sched.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/version.h> + +#include "cpt.h" + +static int cpt_write_head(struct cpt_context *ctx) +{ + struct cpt_head hdr; + + memset(&hdr, 0, sizeof(hdr)); + hdr.cpt_signature[0] = CPT_SIGNATURE0; + hdr.cpt_signature[1] = CPT_SIGNATURE1; + hdr.cpt_signature[2] = CPT_SIGNATURE2; + hdr.cpt_signature[3] = CPT_SIGNATURE3; + hdr.cpt_hdrlen = sizeof(hdr); + hdr.cpt_image_major = (LINUX_VERSION_CODE >> 16) & 0xff; + hdr.cpt_image_minor = (LINUX_VERSION_CODE >> 8) & 0xff; + hdr.cpt_image_sublevel = (LINUX_VERSION_CODE) & 0xff; + hdr.cpt_image_extra = 0; +#if defined(CONFIG_X86_32) + hdr.cpt_arch = CPT_ARCH_I386; +#else +#error Arch is not supported +#endif + return ctx->write(&hdr, sizeof(hdr), ctx); +} + +int dump_container(struct cpt_context *ctx) +{ + int err; + struct task_struct *root; + + read_lock(&tasklist_lock); + root = find_task_by_vpid(ctx->pid); + if (root) + get_task_struct(root); + read_unlock(&tasklist_lock); + + err = -ESRCH; + if (!root) { + eprintk("can not find root task\n"); + return err; + } + ctx->nsproxy = root->nsproxy; + if (!ctx->nsproxy) { + eprintk("nsproxy is null\n"); + goto out; + } + + err = cpt_write_head(ctx); + + /* Dump task here */ + if (!err) + err = -ENOSYS; + +out: + ctx->nsproxy = NULL; + put_task_struct(root); + return err; +} diff --git a/cpt/cpt.h b/cpt/cpt.h index 607ac1b..b421a11 100644 --- a/cpt/cpt.h +++ b/cpt/cpt.h @@ -33,6 +33,7 @@ typedef struct cpt_context int refcount; int ctx_state; struct semaphore main_sem; + struct nsproxy *nsproxy; int errno; @@ -54,3 +55,5 @@ extern int debug_level; #define eprintk(a...) cpt_printk(1, "CPT ERR: " a) #define dprintk(a...) cpt_printk(1, "CPT DBG: " a) + +int dump_container(struct cpt_context *ctx); diff --git a/cpt/sys.c b/cpt/sys.c index 8334c4c..6801c22 100644 --- a/cpt/sys.c +++ b/cpt/sys.c @@ -109,9 +109,10 @@ static int checkpoint(pid_t pid, int fd, unsigned long flags) ctx->file = file; ctx->ctx_state = CPT_CTX_DUMPING; + ctx->pid = pid; /* checkpoint */ - err = -ENOSYS; + err = dump_container(ctx); context_put(ctx); diff --git a/kernel/fork.c b/kernel/fork.c index 52b5037..f38b43d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -77,6 +77,7 @@ int max_threads; /* tunable limit on nr_threads */ DEFINE_PER_CPU(unsigned long, process_counts) = 0; __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */ +EXPORT_SYMBOL(tasklist_lock); int nr_processes(void) { @@ -153,6 +154,7 @@ void __put_task_struct(struct task_struct *tsk) if (!profile_handoff_task(tsk)) free_task(tsk); } +EXPORT_SYMBOL(__put_task_struct); /* * macro override instead of weak attribute alias, to workaround -- 1.5.6 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 5/9] Introduce function to dump process 2008-09-03 10:57 ` [PATCH 4/9] Introduce container dump function Andrey Mirkin @ 2008-09-03 10:57 ` Andrey Mirkin 2008-09-03 10:57 ` [PATCH 6/9] Introduce functions to dump mm Andrey Mirkin 2008-09-03 14:23 ` [PATCH 4/9] Introduce container dump function Serge E. Hallyn 1 sibling, 1 reply; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 10:57 UTC (permalink / raw) To: linux-kernel; +Cc: containers, Andrey Mirkin Functions to dump task struct, fpu state and registers are added. All IDs are saved from the POV of process (container) namespace. Signed-off-by: Andrey Mirkin <major@openvz.org> --- cpt/Makefile | 2 +- cpt/checkpoint.c | 2 +- cpt/cpt.h | 1 + cpt/cpt_image.h | 123 +++++++++++++++++++++++++++ cpt/cpt_process.c | 236 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 362 insertions(+), 2 deletions(-) create mode 100644 cpt/cpt_process.c diff --git a/cpt/Makefile b/cpt/Makefile index 173346b..457cc96 100644 --- a/cpt/Makefile +++ b/cpt/Makefile @@ -2,4 +2,4 @@ obj-y += sys_core.o obj-$(CONFIG_CHECKPOINT) += cptrst.o -cptrst-objs := sys.o checkpoint.o +cptrst-objs := sys.o checkpoint.o cpt_process.o diff --git a/cpt/checkpoint.c b/cpt/checkpoint.c index b4d9686..41c3523 100644 --- a/cpt/checkpoint.c +++ b/cpt/checkpoint.c @@ -65,7 +65,7 @@ int dump_container(struct cpt_context *ctx) /* Dump task here */ if (!err) - err = -ENOSYS; + err = cpt_dump_task(root, ctx); out: ctx->nsproxy = NULL; diff --git a/cpt/cpt.h b/cpt/cpt.h index b421a11..1bb483d 100644 --- a/cpt/cpt.h +++ b/cpt/cpt.h @@ -57,3 +57,4 @@ extern int debug_level; #define dprintk(a...) cpt_printk(1, "CPT DBG: " a) int dump_container(struct cpt_context *ctx); +int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx); diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h index 3d26229..b7b68e1 100644 --- a/cpt/cpt_image.h +++ b/cpt/cpt_image.h @@ -13,6 +13,9 @@ #ifndef __CPT_IMAGE_H_ #define __CPT_IMAGE_H_ 1 +#include <linux/sched.h> +#include <asm/segment.h> + enum _cpt_object_type { CPT_OBJ_TASK = 0, @@ -20,6 +23,8 @@ enum _cpt_object_type /* The objects above are stored in memory while checkpointing */ CPT_OBJ_HEAD = 1024, + CPT_OBJ_X86_REGS, + CPT_OBJ_BITS, }; enum _cpt_content_type { @@ -28,6 +33,8 @@ enum _cpt_content_type { CPT_CONTENT_DATA, CPT_CONTENT_NAME, CPT_CONTENT_REF, + CPT_CONTENT_X86_FPUSTATE, + CPT_CONTENT_X86_FPUSTATE_OLD, CPT_CONTENT_MAX }; @@ -60,4 +67,120 @@ struct cpt_object_hdr __u16 cpt_content; /* Content type: array, reference... */ } __attribute__ ((aligned (8))); +struct cpt_task_image { + __u64 cpt_len; + __u16 cpt_type; + __u32 cpt_hdrlen; + __u16 cpt_content; + + __u64 cpt_state; + __u64 cpt_flags; +#define CPT_PF_EXITING 0 +#define CPT_PF_FORKNOEXEC 1 +#define CPT_PF_SUPERPRIV 2 +#define CPT_PF_DUMPCORE 3 +#define CPT_PF_SIGNALED 4 +#define CPT_PF_USED_MATH 5 + + __u64 cpt_thrflags; + __u64 cpt_thrstatus; + __u32 cpt_pid; + __u32 cpt_tgid; + __u32 cpt_ppid; + __u32 cpt_rppid; + __u32 cpt_pgrp; + __u32 cpt_session; + __u32 cpt_old_pgrp; + __u32 cpt_leader; + __u64 cpt_set_tid; + __u64 cpt_clear_tid; + __u32 cpt_exit_code; + __u32 cpt_exit_signal; + __u32 cpt_pdeath_signal; + __u32 cpt_user; + __u32 cpt_uid; + __u32 cpt_euid; + __u32 cpt_suid; + __u32 cpt_fsuid; + __u32 cpt_gid; + __u32 cpt_egid; + __u32 cpt_sgid; + __u32 cpt_fsgid; + __u8 cpt_comm[TASK_COMM_LEN]; + __u64 cpt_tls[GDT_ENTRY_TLS_ENTRIES]; + __u64 cpt_utime; + __u64 cpt_stime; + __u64 cpt_utimescaled; + __u64 cpt_stimescaled; + __u64 cpt_gtime; + __u64 cpt_prev_utime; + __u64 cpt_prev_stime; + __u64 cpt_start_time; + __u64 cpt_real_start_time; + __u64 cpt_nvcsw; + __u64 cpt_nivcsw; + __u64 cpt_min_flt; + __u64 cpt_maj_flt; +} __attribute__ ((aligned (8))); + +struct cpt_obj_bits +{ + __u64 cpt_len; + __u16 cpt_type; + __u32 cpt_hdrlen; + __u16 cpt_content; + + __u32 cpt_size; + __u32 __cpt_pad1; +} __attribute__ ((aligned (8))); + +#define CPT_SEG_ZERO 0 +#define CPT_SEG_TLS1 1 +#define CPT_SEG_TLS2 2 +#define CPT_SEG_TLS3 3 +#define CPT_SEG_USER32_DS 4 +#define CPT_SEG_USER32_CS 5 +#define CPT_SEG_USER64_DS 6 +#define CPT_SEG_USER64_CS 7 +#define CPT_SEG_LDT 256 + +struct cpt_x86_regs +{ + __u64 cpt_len; + __u16 cpt_type; + __u32 cpt_hdrlen; + __u16 cpt_content; + + __u32 cpt_debugreg[8]; + __u32 cpt_gs; + + __u32 cpt_bx; + __u32 cpt_cx; + __u32 cpt_dx; + __u32 cpt_si; + __u32 cpt_di; + __u32 cpt_bp; + __u32 cpt_ax; + __u32 cpt_ds; + __u32 cpt_es; + __u32 cpt_fs; + __u32 cpt_orig_ax; + __u32 cpt_ip; + __u32 cpt_cs; + __u32 cpt_flags; + __u32 cpt_sp; + __u32 cpt_ss; +} __attribute__ ((aligned (8))); + +static inline __u64 cpt_timespec_export(struct timespec *tv) +{ + return (((u64)tv->tv_sec) << 32) + tv->tv_nsec; +} + +static inline void cpt_timespec_import(struct timespec *tv, __u64 val) +{ + tv->tv_sec = val >> 32; + tv->tv_nsec = (val & 0xFFFFFFFF); +} + #endif /* __CPT_IMAGE_H_ */ diff --git a/cpt/cpt_process.c b/cpt/cpt_process.c new file mode 100644 index 0000000..af4f319 --- /dev/null +++ b/cpt/cpt_process.c @@ -0,0 +1,236 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Author: Andrey Mirkin <major@openvz.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + * + */ + +#include <linux/sched.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/version.h> +#include <linux/nsproxy.h> + +#include "cpt.h" +#include "cpt_image.h" + +static unsigned int encode_task_flags(unsigned int task_flags) +{ + unsigned int flags = 0; + + if (task_flags & PF_EXITING) + flags |= (1 << CPT_PF_EXITING); + if (task_flags & PF_FORKNOEXEC) + flags |= (1 << CPT_PF_FORKNOEXEC); + if (task_flags & PF_SUPERPRIV) + flags |= (1 << CPT_PF_SUPERPRIV); + if (task_flags & PF_DUMPCORE) + flags |= (1 << CPT_PF_DUMPCORE); + if (task_flags & PF_SIGNALED) + flags |= (1 << CPT_PF_SIGNALED); + if (task_flags & PF_USED_MATH) + flags |= (1 << CPT_PF_USED_MATH); + + return flags; + +} + +int cpt_dump_task_struct(struct task_struct *tsk, struct cpt_context *ctx) +{ + struct cpt_task_image *t; + int i; + int err; + + t = kzalloc(sizeof(*t), GFP_KERNEL); + if (!t) + return -ENOMEM; + + t->cpt_len = sizeof(*t); + t->cpt_type = CPT_OBJ_TASK; + t->cpt_hdrlen = sizeof(*t); + t->cpt_content = CPT_CONTENT_ARRAY; + + t->cpt_state = tsk->state; + t->cpt_flags = encode_task_flags(tsk->flags); + t->cpt_exit_code = tsk->exit_code; + t->cpt_exit_signal = tsk->exit_signal; + t->cpt_pdeath_signal = tsk->pdeath_signal; + t->cpt_pid = task_pid_nr_ns(tsk, ctx->nsproxy->pid_ns); + t->cpt_tgid = task_tgid_nr_ns(tsk, ctx->nsproxy->pid_ns); + t->cpt_ppid = tsk->parent ? + task_pid_nr_ns(tsk->parent, ctx->nsproxy->pid_ns) : 0; + t->cpt_rppid = tsk->real_parent ? + task_pid_nr_ns(tsk->real_parent, ctx->nsproxy->pid_ns) : 0; + t->cpt_pgrp = task_pgrp_nr_ns(tsk, ctx->nsproxy->pid_ns); + t->cpt_session = task_session_nr_ns(tsk, ctx->nsproxy->pid_ns); + t->cpt_old_pgrp = 0; + if (tsk->signal->tty_old_pgrp) + t->cpt_old_pgrp = pid_vnr(tsk->signal->tty_old_pgrp); + t->cpt_leader = tsk->group_leader ? task_pid_vnr(tsk->group_leader) : 0; + t->cpt_utime = tsk->utime; + t->cpt_stime = tsk->stime; + t->cpt_utimescaled = tsk->utimescaled; + t->cpt_stimescaled = tsk->stimescaled; + t->cpt_gtime = tsk->gtime; + t->cpt_prev_utime = tsk->prev_utime; + t->cpt_prev_stime = tsk->prev_stime; + t->cpt_nvcsw = tsk->nvcsw; + t->cpt_nivcsw = tsk->nivcsw; + t->cpt_start_time = cpt_timespec_export(&tsk->start_time); + t->cpt_real_start_time = cpt_timespec_export(&tsk->real_start_time); + t->cpt_min_flt = tsk->min_flt; + t->cpt_maj_flt = tsk->maj_flt; + memcpy(t->cpt_comm, tsk->comm, TASK_COMM_LEN); + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) { + t->cpt_tls[i] = (((u64)tsk->thread.tls_array[i].b) << 32) + + tsk->thread.tls_array[i].a; + } + /* TODO: encode thread flags and status like task flags */ + t->cpt_thrflags = task_thread_info(tsk)->flags & ~(1<<TIF_FREEZE); + t->cpt_thrstatus = task_thread_info(tsk)->status; + t->cpt_user = tsk->user->uid; + t->cpt_uid = tsk->uid; + t->cpt_euid = tsk->euid; + t->cpt_suid = tsk->suid; + t->cpt_fsuid = tsk->fsuid; + t->cpt_gid = tsk->gid; + t->cpt_egid = tsk->egid; + t->cpt_sgid = tsk->sgid; + t->cpt_fsgid = tsk->fsgid; + + err = ctx->write(t, sizeof(*t), ctx); + + kfree(t); + return err; +} + +static int cpt_dump_fpustate(struct task_struct *tsk, struct cpt_context *ctx) +{ + struct cpt_obj_bits hdr; + int err; + int content; + unsigned long size; + + content = CPT_CONTENT_X86_FPUSTATE; + size = sizeof(struct i387_fxsave_struct); +#ifndef CONFIG_X86_64 + if (!cpu_has_fxsr) { + size = sizeof(struct i387_fsave_struct); + content = CPT_CONTENT_X86_FPUSTATE_OLD; + } +#endif + + hdr.cpt_len = sizeof(hdr) + size; + hdr.cpt_type = CPT_OBJ_BITS; + hdr.cpt_hdrlen = sizeof(hdr); + hdr.cpt_content = content; + hdr.cpt_size = size; + err = ctx->write(&hdr, sizeof(hdr), ctx); + if (!err) + ctx->write(tsk->thread.xstate, size, ctx); + return err; +} + +static u32 encode_segment(u32 segreg) +{ + segreg &= 0xFFFF; + + if (segreg == 0) + return CPT_SEG_ZERO; + if ((segreg & 3) != 3) { + eprintk("Invalid RPL of a segment reg %x\n", segreg); + return CPT_SEG_ZERO; + } + + /* LDT descriptor, it is just an index to LDT array */ + if (segreg & 4) + return CPT_SEG_LDT + (segreg >> 3); + + /* TLS descriptor. */ + if ((segreg >> 3) >= GDT_ENTRY_TLS_MIN && + (segreg >> 3) <= GDT_ENTRY_TLS_MAX) + return CPT_SEG_TLS1 + ((segreg>>3) - GDT_ENTRY_TLS_MIN); + + /* One of standard desriptors */ +#ifdef CONFIG_X86_64 + if (segreg == __USER32_DS) + return CPT_SEG_USER32_DS; + if (segreg == __USER32_CS) + return CPT_SEG_USER32_CS; + if (segreg == __USER_DS) + return CPT_SEG_USER64_DS; + if (segreg == __USER_CS) + return CPT_SEG_USER64_CS; +#else + if (segreg == __USER_DS) + return CPT_SEG_USER32_DS; + if (segreg == __USER_CS) + return CPT_SEG_USER32_CS; +#endif + eprintk("Invalid segment reg %x\n", segreg); + return CPT_SEG_ZERO; +} + +static int cpt_dump_registers(struct task_struct *tsk, struct cpt_context *ctx) +{ + struct cpt_x86_regs ri; + struct pt_regs *pt_regs; + + ri.cpt_len = sizeof(ri); + ri.cpt_type = CPT_OBJ_X86_REGS; + ri.cpt_hdrlen = sizeof(ri); + ri.cpt_content = CPT_CONTENT_VOID; + + ri.cpt_debugreg[0] = tsk->thread.debugreg0; + ri.cpt_debugreg[1] = tsk->thread.debugreg1; + ri.cpt_debugreg[2] = tsk->thread.debugreg2; + ri.cpt_debugreg[3] = tsk->thread.debugreg3; + ri.cpt_debugreg[4] = 0; + ri.cpt_debugreg[5] = 0; + ri.cpt_debugreg[6] = tsk->thread.debugreg6; + ri.cpt_debugreg[7] = tsk->thread.debugreg7; + + pt_regs = task_pt_regs(tsk); + + ri.cpt_fs = encode_segment(pt_regs->fs); + ri.cpt_gs = encode_segment(tsk->thread.gs); + + ri.cpt_bx = pt_regs->bx; + ri.cpt_cx = pt_regs->cx; + ri.cpt_dx = pt_regs->dx; + ri.cpt_si = pt_regs->si; + ri.cpt_di = pt_regs->di; + ri.cpt_bp = pt_regs->bp; + ri.cpt_ax = pt_regs->ax; + ri.cpt_ds = encode_segment(pt_regs->ds); + ri.cpt_es = encode_segment(pt_regs->es); + ri.cpt_orig_ax = pt_regs->orig_ax; + ri.cpt_ip = pt_regs->ip; + ri.cpt_cs = encode_segment(pt_regs->cs); + ri.cpt_flags = pt_regs->flags; + ri.cpt_sp = pt_regs->sp; + ri.cpt_ss = encode_segment(pt_regs->ss); + + return ctx->write(&ri, sizeof(ri), ctx); +} + +int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx) +{ + int err; + + err = cpt_dump_task_struct(tsk, ctx); + + /* Dump task mm */ + + if (!err) + cpt_dump_fpustate(tsk, ctx); + if (!err) + cpt_dump_registers(tsk, ctx); + + return err; +} -- 1.5.6 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 6/9] Introduce functions to dump mm 2008-09-03 10:57 ` [PATCH 5/9] Introduce function to dump process Andrey Mirkin @ 2008-09-03 10:57 ` Andrey Mirkin 2008-09-03 10:57 ` [PATCH 7/9] Introduce function for restarting a container Andrey Mirkin 2008-09-03 14:17 ` [PATCH 6/9] Introduce functions to dump mm Louis Rilling 0 siblings, 2 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 10:57 UTC (permalink / raw) To: linux-kernel; +Cc: containers, Andrey Mirkin Functions to dump mm struct, VMAs and mm context are added. Signed-off-by: Andrey Mirkin <major@openvz.org> --- arch/x86/mm/hugetlbpage.c | 2 + cpt/Makefile | 2 +- cpt/cpt.h | 1 + cpt/cpt_image.h | 61 +++++++ cpt/cpt_mm.c | 431 +++++++++++++++++++++++++++++++++++++++++++++ cpt/cpt_process.c | 8 +- mm/memory.c | 1 + 7 files changed, 497 insertions(+), 5 deletions(-) create mode 100644 cpt/cpt_mm.c diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index 8f307d9..63028e7 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -12,6 +12,7 @@ #include <linux/slab.h> #include <linux/err.h> #include <linux/sysctl.h> +#include <linux/module.h> #include <asm/mman.h> #include <asm/tlb.h> #include <asm/tlbflush.h> @@ -221,6 +222,7 @@ int pmd_huge(pmd_t pmd) { return !!(pmd_val(pmd) & _PAGE_PSE); } +EXPORT_SYMBOL(pmd_huge); int pud_huge(pud_t pud) { diff --git a/cpt/Makefile b/cpt/Makefile index 457cc96..bbb0e37 100644 --- a/cpt/Makefile +++ b/cpt/Makefile @@ -2,4 +2,4 @@ obj-y += sys_core.o obj-$(CONFIG_CHECKPOINT) += cptrst.o -cptrst-objs := sys.o checkpoint.o cpt_process.o +cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o diff --git a/cpt/cpt.h b/cpt/cpt.h index 1bb483d..73ae296 100644 --- a/cpt/cpt.h +++ b/cpt/cpt.h @@ -58,3 +58,4 @@ extern int debug_level; int dump_container(struct cpt_context *ctx); int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx); +int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx); diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h index b7b68e1..ae019e7 100644 --- a/cpt/cpt_image.h +++ b/cpt/cpt_image.h @@ -16,13 +16,19 @@ #include <linux/sched.h> #include <asm/segment.h> +#define CPT_NULL (~0ULL) + enum _cpt_object_type { CPT_OBJ_TASK = 0, + CPT_OBJ_MM, CPT_OBJ_MAX, /* The objects above are stored in memory while checkpointing */ CPT_OBJ_HEAD = 1024, + CPT_OBJ_VMA, + CPT_OBJ_PAGES, + CPT_OBJ_NAME, CPT_OBJ_X86_REGS, CPT_OBJ_BITS, }; @@ -35,6 +41,7 @@ enum _cpt_content_type { CPT_CONTENT_REF, CPT_CONTENT_X86_FPUSTATE, CPT_CONTENT_X86_FPUSTATE_OLD, + CPT_CONTENT_MM_CONTEXT, CPT_CONTENT_MAX }; @@ -123,6 +130,60 @@ struct cpt_task_image { __u64 cpt_maj_flt; } __attribute__ ((aligned (8))); +struct cpt_mm_image { + __u64 cpt_len; + __u16 cpt_type; + __u32 cpt_hdrlen; + __u16 cpt_content; + + __u64 cpt_start_code; + __u64 cpt_end_code; + __u64 cpt_start_data; + __u64 cpt_end_data; + __u64 cpt_start_brk; + __u64 cpt_brk; + __u64 cpt_start_stack; + __u64 cpt_start_arg; + __u64 cpt_end_arg; + __u64 cpt_start_env; + __u64 cpt_end_env; + __u64 cpt_def_flags; + __u64 cpt_flags; + __u64 cpt_map_count; +} __attribute__ ((aligned (8))); + +struct cpt_vma_image +{ + __u64 cpt_len; + __u16 cpt_type; + __u32 cpt_hdrlen; + __u16 cpt_content; + + __u64 cpt_file; + __u32 cpt_vma_type; +#define CPT_VMA_TYPE_0 0 +#define CPT_VMA_FILE 1 + __u32 cpt_pad; + + __u64 cpt_start; + __u64 cpt_end; + __u64 cpt_flags; + __u64 cpt_pgprot; + __u64 cpt_pgoff; + __u64 cpt_page_num; +} __attribute__ ((aligned (8))); + +struct cpt_page_block +{ + __u64 cpt_len; + __u16 cpt_type; + __u32 cpt_hdrlen; + __u16 cpt_content; + + __u64 cpt_start; + __u64 cpt_end; +} __attribute__ ((aligned (8))); + struct cpt_obj_bits { __u64 cpt_len; diff --git a/cpt/cpt_mm.c b/cpt/cpt_mm.c new file mode 100644 index 0000000..6e025cc --- /dev/null +++ b/cpt/cpt_mm.c @@ -0,0 +1,431 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Authors: Andrey Mirkin <major@openvz.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + * + */ + +#include <linux/sched.h> +#include <linux/slab.h> +#include <linux/file.h> +#include <linux/mm.h> +#include <linux/errno.h> +#include <linux/major.h> +#include <linux/mman.h> +#include <linux/mnt_namespace.h> +#include <linux/mount.h> +#include <linux/namei.h> +#include <linux/pagemap.h> +#include <linux/hugetlb.h> +#include <asm/ldt.h> + +#include "cpt.h" +#include "cpt_image.h" + +struct page_area +{ + int type; + unsigned long start; + unsigned long end; + pgoff_t pgoff; + loff_t mm; + __u64 list[16]; +}; + +struct page_desc +{ + int type; + pgoff_t index; + loff_t mm; + int shared; +}; + +enum { + PD_ABSENT, + PD_COPY, + PD_FUNKEY, +}; + +/* 0: page can be obtained from backstore, or still not mapped anonymous page, + or something else, which does not requre copy. + 1: page requires copy + 2: page requres copy but its content is zero. Quite useless. + 3: wp page is shared after fork(). It is to be COWed when modified. + 4: page is something unsupported... We copy it right now. + */ + +static void page_get_desc(struct vm_area_struct *vma, unsigned long addr, + struct page_desc *pdesc, cpt_context_t * ctx) +{ + struct mm_struct *mm = vma->vm_mm; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep, pte; + spinlock_t *ptl; + struct page *pg = NULL; + pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff; + + pdesc->index = linear_index; + pdesc->shared = 0; + pdesc->mm = CPT_NULL; + + if (vma->vm_flags & VM_IO) { + pdesc->type = PD_ABSENT; + return; + } + + pgd = pgd_offset(mm, addr); + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) + goto out_absent; + pud = pud_offset(pgd, addr); + if (pud_none(*pud) || unlikely(pud_bad(*pud))) + goto out_absent; + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) + goto out_absent; +#ifdef CONFIG_X86 + if (pmd_huge(*pmd)) { + eprintk("page_huge\n"); + goto out_unsupported; + } +#endif + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); + pte = *ptep; + pte_unmap(ptep); + + if (pte_none(pte)) + goto out_absent_unlock; + + if ((pg = vm_normal_page(vma, addr, pte)) == NULL) { + pdesc->type = PD_COPY; + goto out_unlock; + } + + get_page(pg); + spin_unlock(ptl); + + if (pg->mapping && !PageAnon(pg)) { + if (vma->vm_file == NULL) { + eprintk("pg->mapping!=NULL for fileless vma: %08lx\n", addr); + goto out_unsupported; + } + if (vma->vm_file->f_mapping != pg->mapping) { + eprintk("pg->mapping!=f_mapping: %08lx %p %p\n", + addr, vma->vm_file->f_mapping, pg->mapping); + goto out_unsupported; + } + pdesc->index = (pg->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT)); + /* Page is in backstore. For us it is like + * it is not present. + */ + goto out_absent; + } + + if (PageReserved(pg)) { + /* Special case: ZERO_PAGE is used, when an + * anonymous page is accessed but not written. */ + if (pg == ZERO_PAGE(addr)) { + if (pte_write(pte)) { + eprintk("not funny already, writable ZERO_PAGE\n"); + goto out_unsupported; + } + goto out_absent; + } + eprintk("reserved page %lu at %08lx\n", pg->index, addr); + goto out_unsupported; + } + + if (!pg->mapping) { + eprintk("page without mapping at %08lx\n", addr); + goto out_unsupported; + } + + pdesc->type = PD_COPY; + +out_put: + if (pg) + put_page(pg); + return; + +out_unlock: + spin_unlock(ptl); + goto out_put; + +out_absent_unlock: + spin_unlock(ptl); + +out_absent: + pdesc->type = PD_ABSENT; + goto out_put; + +out_unsupported: + pdesc->type = PD_FUNKEY; + goto out_put; +} + +static int count_vma_pages(struct vm_area_struct *vma, struct cpt_context *ctx) +{ + unsigned long addr; + int page_num = 0; + + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { + struct page_desc pd; + + page_get_desc(vma, addr, &pd, ctx); + + if (pd.type != PD_COPY) { + return -EINVAL; + } else { + page_num += 1; + } + + } + return page_num; +} + +/* ATTN: We give "current" to get_user_pages(). This is wrong, but get_user_pages() + * does not really need this thing. It just stores some page fault stats there. + * + * BUG: some archs (f.e. sparc64, but not Intel*) require flush cache pages + * before accessing vma. + */ +static int dump_pages(struct vm_area_struct *vma, unsigned long start, + unsigned long end, struct cpt_context *ctx) +{ +#define MAX_PAGE_BATCH 16 + struct page *pg[MAX_PAGE_BATCH]; + int npages = (end - start)/PAGE_SIZE; + int count = 0; + + while (count < npages) { + int copy = npages - count; + int n; + + if (copy > MAX_PAGE_BATCH) + copy = MAX_PAGE_BATCH; + n = get_user_pages(current, vma->vm_mm, start, copy, + 0, 1, pg, NULL); + if (n == copy) { + int i; + for (i=0; i<n; i++) { + char *maddr = kmap(pg[i]); + ctx->write(maddr, PAGE_SIZE, ctx); + kunmap(pg[i]); + } + } else { + eprintk("get_user_pages fault"); + for ( ; n > 0; n--) + page_cache_release(pg[n-1]); + return -EFAULT; + } + start += n*PAGE_SIZE; + count += n; + for ( ; n > 0; n--) + page_cache_release(pg[n-1]); + } + return 0; +} + +static int dump_page_block(struct vm_area_struct *vma, + struct cpt_page_block *pgb, + struct cpt_context *ctx) +{ + int err; + pgb->cpt_len = sizeof(*pgb) + pgb->cpt_end - pgb->cpt_start; + pgb->cpt_type = CPT_OBJ_PAGES; + pgb->cpt_hdrlen = sizeof(*pgb); + pgb->cpt_content = CPT_CONTENT_DATA; + + err = ctx->write(pgb, sizeof(*pgb), ctx); + if (!err) + err = dump_pages(vma, pgb->cpt_start, pgb->cpt_end, ctx); + + return err; +} + +static int cpt_dump_dentry(struct path *p, cpt_context_t *ctx) +{ + int len; + char *path; + char *buf; + struct cpt_object_hdr o; + + buf = (char *)__get_free_page(GFP_KERNEL); + if (!buf) + return -ENOMEM; + + path = d_path(p, buf, PAGE_SIZE); + + if (IS_ERR(path)) { + free_page((unsigned long)buf); + return PTR_ERR(path); + } + + len = buf + PAGE_SIZE - 1 - path; + o.cpt_len = sizeof(o) + len + 1; + o.cpt_type = CPT_OBJ_NAME; + o.cpt_hdrlen = sizeof(o); + o.cpt_content = CPT_CONTENT_NAME; + path[len] = 0; + + ctx->write(&o, sizeof(o), ctx); + ctx->write(path, len + 1, ctx); + free_page((unsigned long)buf); + + return 0; +} + +static int dump_one_vma(struct mm_struct *mm, + struct vm_area_struct *vma, struct cpt_context *ctx) +{ + struct cpt_vma_image *v; + unsigned long addr; + int page_num; + int err; + + v = kzalloc(sizeof(*v), GFP_KERNEL); + if (!v) + return -ENOMEM; + + v->cpt_len = sizeof(*v); + v->cpt_type = CPT_OBJ_VMA; + v->cpt_hdrlen = sizeof(*v); + v->cpt_content = CPT_CONTENT_ARRAY; + + v->cpt_start = vma->vm_start; + v->cpt_end = vma->vm_end; + v->cpt_flags = vma->vm_flags; + if (vma->vm_flags & VM_HUGETLB) { + eprintk("huge TLB VMAs are still not supported\n"); + kfree(v); + return -EINVAL; + } + v->cpt_pgprot = vma->vm_page_prot.pgprot; + v->cpt_pgoff = vma->vm_pgoff; + v->cpt_file = CPT_NULL; + v->cpt_vma_type = CPT_VMA_TYPE_0; + + page_num = count_vma_pages(vma, ctx); + if (page_num < 0) { + kfree(v); + return -EINVAL; + } + v->cpt_page_num = page_num; + + if (vma->vm_file) { + v->cpt_file = 0; + v->cpt_vma_type = CPT_VMA_FILE; + } + + ctx->write(v, sizeof(*v), ctx); + kfree(v); + + if (vma->vm_file) { + err = cpt_dump_dentry(&vma->vm_file->f_path, ctx); + if (err < 0) + return err; + } + + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { + struct page_desc pd; + struct cpt_page_block pgb; + + page_get_desc(vma, addr, &pd, ctx); + + if (pd.type == PD_FUNKEY || pd.type == PD_ABSENT) { + eprintk("dump_one_vma: funkey page\n"); + return -EINVAL; + } + + pgb.cpt_start = addr; + pgb.cpt_end = addr + PAGE_SIZE; + dump_page_block(vma, &pgb, ctx); + } + + return 0; +} + +static int cpt_dump_mm_context(struct mm_struct *mm, struct cpt_context *ctx) +{ +#ifdef CONFIG_X86 + if (mm->context.size) { + struct cpt_obj_bits b; + int size; + + mutex_lock(&mm->context.lock); + + b.cpt_type = CPT_OBJ_BITS; + b.cpt_len = sizeof(b); + b.cpt_content = CPT_CONTENT_MM_CONTEXT; + b.cpt_size = mm->context.size * LDT_ENTRY_SIZE; + + ctx->write(&b, sizeof(b), ctx); + + size = mm->context.size * LDT_ENTRY_SIZE; + + ctx->write(mm->context.ldt, size, ctx); + + mutex_unlock(&mm->context.lock); + } +#endif + return 0; +} + +int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx) +{ + struct mm_struct *mm = tsk->mm; + struct cpt_mm_image *v; + struct vm_area_struct *vma; + int err; + + v = kzalloc(sizeof(*v), GFP_KERNEL); + if (!v) + return -ENOMEM; + + v->cpt_len = sizeof(*v); + v->cpt_type = CPT_OBJ_MM; + v->cpt_hdrlen = sizeof(*v); + v->cpt_content = CPT_CONTENT_ARRAY; + + v->cpt_start_code = mm->start_code; + v->cpt_end_code = mm->end_code; + v->cpt_start_data = mm->start_data; + v->cpt_end_data = mm->end_data; + v->cpt_start_brk = mm->start_brk; + v->cpt_brk = mm->brk; + v->cpt_start_stack = mm->start_stack; + v->cpt_start_arg = mm->arg_start; + v->cpt_end_arg = mm->arg_end; + v->cpt_start_env = mm->env_start; + v->cpt_end_env = mm->env_end; + v->cpt_def_flags = mm->def_flags; + v->cpt_flags = mm->flags; + v->cpt_map_count = mm->map_count; + + err = ctx->write(v, sizeof(*v), ctx); + kfree(v); + + if (err) { + eprintk("error during writing mm\n"); + return err; + } + + for (vma = mm->mmap; vma; vma = vma->vm_next) { + int err; + + if ((err = dump_one_vma(mm, vma, ctx)) != 0) + return err; + } + + if (!err) + err = cpt_dump_mm_context(mm, ctx); + + return err; +} + diff --git a/cpt/cpt_process.c b/cpt/cpt_process.c index af4f319..7c4b981 100644 --- a/cpt/cpt_process.c +++ b/cpt/cpt_process.c @@ -225,12 +225,12 @@ int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx) err = cpt_dump_task_struct(tsk, ctx); - /* Dump task mm */ - if (!err) - cpt_dump_fpustate(tsk, ctx); + err = cpt_dump_mm(tsk, ctx); + if (!err) + err = cpt_dump_fpustate(tsk, ctx); if (!err) - cpt_dump_registers(tsk, ctx); + err = cpt_dump_registers(tsk, ctx); return err; } diff --git a/mm/memory.c b/mm/memory.c index 1002f47..479a294 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -481,6 +481,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, out: return pfn_to_page(pfn); } +EXPORT_SYMBOL(vm_normal_page); /* * copy one vm_area from one task to the other. Assumes the page tables -- 1.5.6 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 7/9] Introduce function for restarting a container 2008-09-03 10:57 ` [PATCH 6/9] Introduce functions to dump mm Andrey Mirkin @ 2008-09-03 10:57 ` Andrey Mirkin 2008-09-03 10:57 ` [PATCH 8/9] Introduce functions to restart a process Andrey Mirkin 2008-09-03 14:17 ` [PATCH 6/9] Introduce functions to dump mm Louis Rilling 1 sibling, 1 reply; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 10:57 UTC (permalink / raw) To: linux-kernel; +Cc: containers, Andrey Mirkin Actually, right now this function will restart only one process. Function to read head of dump file is introduced. Signed-off-by: Andrey Mirkin <major@openvz.org> --- cpt/Makefile | 2 +- cpt/cpt.h | 1 + cpt/restart.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ cpt/sys.c | 2 +- 4 files changed, 90 insertions(+), 2 deletions(-) create mode 100644 cpt/restart.c diff --git a/cpt/Makefile b/cpt/Makefile index bbb0e37..47c7852 100644 --- a/cpt/Makefile +++ b/cpt/Makefile @@ -2,4 +2,4 @@ obj-y += sys_core.o obj-$(CONFIG_CHECKPOINT) += cptrst.o -cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o +cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o diff --git a/cpt/cpt.h b/cpt/cpt.h index 73ae296..6471246 100644 --- a/cpt/cpt.h +++ b/cpt/cpt.h @@ -59,3 +59,4 @@ extern int debug_level; int dump_container(struct cpt_context *ctx); int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx); int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx); +int restart_container(struct cpt_context *ctx); diff --git a/cpt/restart.c b/cpt/restart.c new file mode 100644 index 0000000..5770985 --- /dev/null +++ b/cpt/restart.c @@ -0,0 +1,87 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Author: Andrey Mirkin <major@openvz.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + * + */ + +#include <linux/sched.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/version.h> + +#include "cpt.h" +#include "cpt_image.h" + +int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx) +{ + int err; + struct cpt_object_hdr *hdr = tmp; + err = ctx->read(hdr, sizeof(struct cpt_object_hdr), ctx); + if (err) + return err; + if (type > 0 && type != hdr->cpt_type) + return -EINVAL; + if (hdr->cpt_hdrlen < sizeof(struct cpt_object_hdr)) + return -EINVAL; + if (size < sizeof(struct cpt_object_hdr)) + return -EINVAL; + if (hdr->cpt_len < hdr->cpt_hdrlen) + return -EINVAL; + if (size > hdr->cpt_hdrlen) + size = hdr->cpt_hdrlen; + if (size > sizeof(*hdr)) + err = ctx->read(hdr + 1, size - sizeof(*hdr), ctx); + return err; +} + +static int rst_read_head(struct cpt_context *ctx) +{ + struct cpt_head hdr; + int err; + + err = -EBADF; + if (!ctx->file) + return err; + + err = ctx->read(&hdr, sizeof(hdr), ctx); + if (err < 0) + return err; + + if (hdr.cpt_signature[0] != CPT_SIGNATURE0 || + hdr.cpt_signature[1] != CPT_SIGNATURE1 || + hdr.cpt_signature[2] != CPT_SIGNATURE2 || + hdr.cpt_signature[3] != CPT_SIGNATURE3) { + return -EINVAL; + } + if (KERNEL_VERSION(hdr.cpt_image_major, hdr.cpt_image_minor, + hdr.cpt_image_sublevel) != LINUX_VERSION_CODE) + return -EINVAL; + +#if defined(CONFIG_X86_32) + if (hdr.cpt_arch != CPT_ARCH_I386) + return -ENOSYS; +#else +#error Arch is not supported +#endif + + return 0; +} + +int restart_container(struct cpt_context *ctx) +{ + int err; + + err = rst_read_head(ctx); + + /* Restart process */ + if (!err) + err = -ENOSYS; + + return err; +} diff --git a/cpt/sys.c b/cpt/sys.c index 6801c22..5b453f2 100644 --- a/cpt/sys.c +++ b/cpt/sys.c @@ -142,7 +142,7 @@ static int restart(int ctid, int fd, unsigned long flags) ctx->ctx_state = CPT_CTX_UNDUMPING; /* restart */ - err = -ENOSYS; + err = restart_container(ctx); context_put(ctx); -- 1.5.6 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 8/9] Introduce functions to restart a process 2008-09-03 10:57 ` [PATCH 7/9] Introduce function for restarting a container Andrey Mirkin @ 2008-09-03 10:57 ` Andrey Mirkin 2008-09-03 10:57 ` [PATCH 9/9] Introduce functions to restore mm Andrey Mirkin 2008-09-03 14:32 ` [PATCH 8/9] Introduce functions to restart a process Louis Rilling 0 siblings, 2 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 10:57 UTC (permalink / raw) To: linux-kernel; +Cc: containers, Andrey Mirkin Functions to restart process, restore its state, fpu and registers are added. Signed-off-by: Andrey Mirkin <major@openvz.org> --- arch/x86/kernel/entry_32.S | 21 +++ arch/x86/kernel/process_32.c | 3 + cpt/Makefile | 3 +- cpt/cpt.h | 2 + cpt/restart.c | 2 +- cpt/rst_process.c | 277 ++++++++++++++++++++++++++++++++++++++++++ kernel/sched.c | 1 + 7 files changed, 307 insertions(+), 2 deletions(-) create mode 100644 cpt/rst_process.c diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 109792b..a4848a3 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -225,6 +225,7 @@ ENTRY(ret_from_fork) GET_THREAD_INFO(%ebp) popl %eax CFI_ADJUST_CFA_OFFSET -4 +ret_from_fork_tail: pushl $0x0202 # Reset kernel eflags CFI_ADJUST_CFA_OFFSET 4 popfl @@ -233,6 +234,26 @@ ENTRY(ret_from_fork) CFI_ENDPROC END(ret_from_fork) +ENTRY(i386_ret_from_resume) + CFI_STARTPROC + pushl %eax + CFI_ADJUST_CFA_OFFSET 4 + call schedule_tail + GET_THREAD_INFO(%ebp) + popl %eax + CFI_ADJUST_CFA_OFFSET -4 + movl (%esp), %eax + testl %eax, %eax + jz 1f + pushl %esp + call *%eax + addl $4, %esp +1: + addl $256, %esp + jmp ret_from_fork_tail + CFI_ENDPROC +END(i386_ret_from_resume) + /* * Return to user mode is not as complex as all this looks, * but we want the default path for a system call return to diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 4711eed..1bdec02 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -58,6 +58,9 @@ #include <asm/kdebug.h> asmlinkage void ret_from_fork(void) __asm__("ret_from_fork"); +EXPORT_SYMBOL(ret_from_fork); +asmlinkage void i386_ret_from_resume(void) __asm__("i386_ret_from_resume"); +EXPORT_SYMBOL(i386_ret_from_resume); DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task; EXPORT_PER_CPU_SYMBOL(current_task); diff --git a/cpt/Makefile b/cpt/Makefile index 47c7852..689a0eb 100644 --- a/cpt/Makefile +++ b/cpt/Makefile @@ -2,4 +2,5 @@ obj-y += sys_core.o obj-$(CONFIG_CHECKPOINT) += cptrst.o -cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o +cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o \ + rst_process.o diff --git a/cpt/cpt.h b/cpt/cpt.h index 6471246..d59255f 100644 --- a/cpt/cpt.h +++ b/cpt/cpt.h @@ -60,3 +60,5 @@ int dump_container(struct cpt_context *ctx); int cpt_dump_task(struct task_struct *tsk, struct cpt_context *ctx); int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx); int restart_container(struct cpt_context *ctx); +int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx); +int rst_restart_process(struct cpt_context *ctx); diff --git a/cpt/restart.c b/cpt/restart.c index 5770985..e856955 100644 --- a/cpt/restart.c +++ b/cpt/restart.c @@ -81,7 +81,7 @@ int restart_container(struct cpt_context *ctx) /* Restart process */ if (!err) - err = -ENOSYS; + err = rst_restart_process(ctx); return err; } diff --git a/cpt/rst_process.c b/cpt/rst_process.c new file mode 100644 index 0000000..6d47f3c --- /dev/null +++ b/cpt/rst_process.c @@ -0,0 +1,277 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Author: Andrey Mirkin <major@openvz.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + * + */ + +#include <linux/sched.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/version.h> +#include <linux/module.h> + +#include "cpt.h" +#include "cpt_image.h" + +#define HOOK_RESERVE 256 + +struct thr_context { + struct completion complete; + int error; + struct cpt_context *ctx; + struct task_struct *tsk; +}; + +int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long flags, pid_t pid) +{ + pid_t ret; + + if (current->fs == NULL) { + /* do_fork_pid() hates processes without fs, oopses. */ + eprintk("local_kernel_thread: current->fs==NULL\n"); + return -EINVAL; + } + if (!try_module_get(THIS_MODULE)) + return -EBUSY; + ret = kernel_thread(fn, arg, flags); + if (ret < 0) + module_put(THIS_MODULE); + return ret; +} + +static unsigned int decode_task_flags(unsigned int task_flags) +{ + unsigned int flags = 0; + + if (task_flags & (1 << CPT_PF_EXITING)) + flags |= PF_EXITING; + if (task_flags & (1 << CPT_PF_FORKNOEXEC)) + flags |= PF_FORKNOEXEC; + if (task_flags & (1 << CPT_PF_SUPERPRIV)) + flags |= PF_SUPERPRIV; + if (task_flags & (1 << CPT_PF_DUMPCORE)) + flags |= PF_DUMPCORE; + if (task_flags & (1 << CPT_PF_SIGNALED)) + flags |= PF_SIGNALED; + + return flags; + +} + +int rst_restore_task_struct(struct task_struct *tsk, struct cpt_task_image *ti, + struct cpt_context *ctx) +{ + int i; + + /* Restore only saved flags, comm and tls for now */ + tsk->flags = decode_task_flags(ti->cpt_flags); + clear_tsk_thread_flag(tsk, TIF_FREEZE); + memcpy(tsk->comm, ti->cpt_comm, TASK_COMM_LEN); + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) { + tsk->thread.tls_array[i].a = ti->cpt_tls[i] & 0xFFFFFFFF; + tsk->thread.tls_array[i].b = ti->cpt_tls[i] >> 32; + } + + return 0; +} + +static int rst_restore_fpustate(struct task_struct *tsk, struct cpt_task_image *ti, + struct cpt_context *ctx) +{ + struct cpt_obj_bits hdr; + int err; + char *buf; + + clear_stopped_child_used_math(tsk); + + err = rst_get_object(CPT_OBJ_BITS, &hdr, sizeof(hdr), ctx); + if (err < 0) + return err; + + buf = kmalloc(hdr.cpt_size, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + err = ctx->read(buf, hdr.cpt_size, ctx); + if (err) + goto out; + + if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE && cpu_has_fxsr) { + memcpy(&tsk->thread.xstate, buf, + sizeof(struct i387_fxsave_struct)); + if (ti->cpt_flags & CPT_PF_USED_MATH) + set_stopped_child_used_math(tsk); + } +#ifndef CONFIG_X86_64 + else if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE_OLD && + !cpu_has_fxsr) { + memcpy(&tsk->thread.xstate, buf, + sizeof(struct i387_fsave_struct)); + if (ti->cpt_flags & CPT_PF_USED_MATH) + set_stopped_child_used_math(tsk); + } +#endif + +out: + kfree(buf); + return err; +} + +static u32 decode_segment(u32 segid) +{ + if (segid == CPT_SEG_ZERO) + return 0; + + /* TLS descriptors */ + if (segid <= CPT_SEG_TLS3) + return ((GDT_ENTRY_TLS_MIN + segid - CPT_SEG_TLS1) << 3) + 3; + + /* LDT descriptor, it is just an index to LDT array */ + if (segid >= CPT_SEG_LDT) + return ((segid - CPT_SEG_LDT) << 3) | 7; + + /* Check for one of standard descriptors */ + if (segid == CPT_SEG_USER32_DS) + return __USER_DS; + if (segid == CPT_SEG_USER32_CS) + return __USER_CS; + + eprintk("Invalid segment reg %d\n", segid); + return 0; +} + +static int rst_restore_registers(struct task_struct *tsk, struct cpt_context *ctx) +{ + struct cpt_x86_regs ri; + struct pt_regs *regs = task_pt_regs(tsk); + extern char i386_ret_from_resume; + int err; + + err = rst_get_object(CPT_OBJ_X86_REGS, &ri, sizeof(ri), ctx); + if (err < 0) + return err; + + tsk->thread.sp = (unsigned long) regs; + tsk->thread.sp0 = (unsigned long) (regs+1); + tsk->thread.ip = (unsigned long) &i386_ret_from_resume; + + tsk->thread.gs = decode_segment(ri.cpt_gs); + tsk->thread.debugreg0 = ri.cpt_debugreg[0]; + tsk->thread.debugreg1 = ri.cpt_debugreg[1]; + tsk->thread.debugreg2 = ri.cpt_debugreg[2]; + tsk->thread.debugreg3 = ri.cpt_debugreg[3]; + tsk->thread.debugreg6 = ri.cpt_debugreg[6]; + tsk->thread.debugreg7 = ri.cpt_debugreg[7]; + + regs->bx = ri.cpt_bx; + regs->cx = ri.cpt_cx; + regs->dx = ri.cpt_dx; + regs->si = ri.cpt_si; + regs->di = ri.cpt_di; + regs->bp = ri.cpt_bp; + regs->ax = ri.cpt_ax; + regs->orig_ax = ri.cpt_orig_ax; + regs->ip = ri.cpt_ip; + regs->flags = ri.cpt_flags; + regs->sp = ri.cpt_sp; + + regs->cs = decode_segment(ri.cpt_cs); + regs->ss = decode_segment(ri.cpt_ss); + regs->ds = decode_segment(ri.cpt_ds); + regs->es = decode_segment(ri.cpt_es); + regs->fs = decode_segment(ri.cpt_fs); + + tsk->thread.sp -= HOOK_RESERVE; + memset((void*)tsk->thread.sp, 0, HOOK_RESERVE); + + return 0; +} + +static int restart_thread(void *arg) +{ + struct thr_context *thr_ctx = arg; + struct cpt_context *ctx; + struct cpt_task_image *ti; + int err; + + current->state = TASK_UNINTERRUPTIBLE; + + ctx = thr_ctx->ctx; + ti = kmalloc(sizeof(*ti), GFP_KERNEL); + if (!ti) + return -ENOMEM; + + err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx); + if (!err) + err = rst_restore_task_struct(current, ti, ctx); + /* Restore mm here */ + if (!err) + err = rst_restore_fpustate(current, ti, ctx); + if (!err) + err = rst_restore_registers(current, ctx); + + thr_ctx->error = err; + complete(&thr_ctx->complete); + + if (!err && (ti->cpt_state & (EXIT_ZOMBIE|EXIT_DEAD))) { + do_exit(ti->cpt_exit_code); + } else { + __set_current_state(TASK_UNINTERRUPTIBLE); + } + + kfree(ti); + schedule(); + + eprintk("leaked %d/%d %p\n", task_pid_nr(current), task_pid_vnr(current), current->mm); + + module_put(THIS_MODULE); + complete_and_exit(NULL, 0); + return 0; +} +static int create_root_task(struct cpt_context *ctx, + struct thr_context *thr_ctx) +{ + struct task_struct *tsk; + int pid; + + thr_ctx->ctx = ctx; + thr_ctx->error = 0; + init_completion(&thr_ctx->complete); + + /* We should also create container here */ + pid = local_kernel_thread(restart_thread, thr_ctx, + CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | + CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNET, 0); + if (pid < 0) + return pid; + read_lock(&tasklist_lock); + tsk = find_task_by_vpid(pid); + if (tsk) + get_task_struct(tsk); + read_unlock(&tasklist_lock); + if (tsk == NULL) + return -ESRCH; + thr_ctx->tsk = tsk; + return 0; +} + +int rst_restart_process(struct cpt_context *ctx) +{ + struct thr_context thr_ctx_root; + int err; + + err = create_root_task(ctx, &thr_ctx_root); + if (err) + return err; + + wait_for_completion(&thr_ctx_root.complete); + wait_task_inactive(thr_ctx_root.tsk, 0); + + return err; +} diff --git a/kernel/sched.c b/kernel/sched.c index 04160d2..94a23e5 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1970,6 +1970,7 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state) return ncsw; } +EXPORT_SYMBOL(wait_task_inactive); /*** * kick_process - kick a running thread to enter/exit the kernel -- 1.5.6 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH 9/9] Introduce functions to restore mm 2008-09-03 10:57 ` [PATCH 8/9] Introduce functions to restart a process Andrey Mirkin @ 2008-09-03 10:57 ` Andrey Mirkin 2008-09-03 14:32 ` [PATCH 8/9] Introduce functions to restart a process Louis Rilling 1 sibling, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 10:57 UTC (permalink / raw) To: linux-kernel; +Cc: containers, Andrey Mirkin Functions to restore mm, VMAs and mm context are added. Signed-off-by: Andrey Mirkin <major@openvz.org> --- cpt/Makefile | 2 +- cpt/cpt.h | 1 + cpt/cpt_image.h | 5 + cpt/rst_mm.c | 320 +++++++++++++++++++++++++++++++++++++++++++++++++++++ cpt/rst_process.c | 3 +- mm/mmap.c | 1 + mm/mprotect.c | 2 + 8 files changed, 336 insertions(+), 2 deletions(-) create mode 100644 cpt/rst_mm.c diff --git a/cpt/Makefile b/cpt/Makefile index 689a0eb..19ca732 100644 --- a/cpt/Makefile +++ b/cpt/Makefile @@ -3,4 +3,4 @@ obj-y += sys_core.o obj-$(CONFIG_CHECKPOINT) += cptrst.o cptrst-objs := sys.o checkpoint.o cpt_process.o cpt_mm.o restart.o \ - rst_process.o + rst_process.o rst_mm.o diff --git a/cpt/cpt.h b/cpt/cpt.h index d59255f..f2a1b28 100644 --- a/cpt/cpt.h +++ b/cpt/cpt.h @@ -62,3 +62,4 @@ int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx); int restart_container(struct cpt_context *ctx); int rst_get_object(int type, void *tmp, int size, struct cpt_context *ctx); int rst_restart_process(struct cpt_context *ctx); +int rst_restore_mm(struct cpt_context *ctx); diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h index ae019e7..3b5b418 100644 --- a/cpt/cpt_image.h +++ b/cpt/cpt_image.h @@ -233,6 +233,11 @@ struct cpt_x86_regs __u32 cpt_ss; } __attribute__ ((aligned (8))); +static inline void __user * cpt_ptr_import(__u64 ptr) +{ + return (void*)(unsigned long)ptr; +} + static inline __u64 cpt_timespec_export(struct timespec *tv) { return (((u64)tv->tv_sec) << 32) + tv->tv_nsec; diff --git a/cpt/rst_mm.c b/cpt/rst_mm.c new file mode 100644 index 0000000..cccbff6 --- /dev/null +++ b/cpt/rst_mm.c @@ -0,0 +1,320 @@ +/* + * Copyright (C) 2008 Parallels, Inc. + * + * Author: Andrey Mirkin <major@openvz.org> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + * + */ + +#include <linux/sched.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/version.h> +#include <linux/module.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include <linux/highmem.h> +#include <linux/pagemap.h> +#include <linux/vmalloc.h> +#include <linux/syscalls.h> + +#include "cpt.h" +#include "cpt_image.h" + +static unsigned long make_prot(struct cpt_vma_image *vmai) +{ + unsigned long prot = 0; + + if (vmai->cpt_flags & VM_READ) + prot |= PROT_READ; + if (vmai->cpt_flags & VM_WRITE) + prot |= PROT_WRITE; + if (vmai->cpt_flags & VM_EXEC) + prot |= PROT_EXEC; + if (vmai->cpt_flags & VM_GROWSDOWN) + prot |= PROT_GROWSDOWN; + if (vmai->cpt_flags & VM_GROWSUP) + prot |= PROT_GROWSUP; + return prot; +} + +static unsigned long make_flags(struct cpt_vma_image *vmai) +{ + unsigned long flags = MAP_FIXED; + + if (vmai->cpt_flags&(VM_SHARED|VM_MAYSHARE)) + flags |= MAP_SHARED; + else + flags |= MAP_PRIVATE; + + if (vmai->cpt_file == CPT_NULL) + flags |= MAP_ANONYMOUS; + if (vmai->cpt_flags & VM_GROWSDOWN) + flags |= MAP_GROWSDOWN; +#ifdef MAP_GROWSUP + if (vmai->cpt_flags & VM_GROWSUP) + flags |= MAP_GROWSUP; +#endif + if (vmai->cpt_flags & VM_DENYWRITE) + flags |= MAP_DENYWRITE; + if (vmai->cpt_flags & VM_EXECUTABLE) + flags |= MAP_EXECUTABLE; + if (!(vmai->cpt_flags & VM_ACCOUNT)) + flags |= MAP_NORESERVE; + return flags; +} + +static int rst_restore_one_vma(struct cpt_context *ctx) +{ + int err; + int i; + unsigned long addr; + struct mm_struct *mm = current->mm; + struct cpt_vma_image vmai; + struct vm_area_struct *vma; + struct file *file = NULL; + unsigned long prot; + + err = rst_get_object(CPT_OBJ_VMA, &vmai, sizeof(vmai), ctx); + if (err) + return err; + + prot = make_prot(&vmai); + + if (vmai.cpt_vma_type == CPT_VMA_FILE) { + struct cpt_object_hdr h; + int len; + char *path; + + err = rst_get_object(CPT_OBJ_NAME, &h, sizeof(h), ctx); + if (err) + goto out; + len = h.cpt_len - sizeof(h); + if (len < 0) { + err = -EINVAL; + goto out; + } + path = kmalloc(len, GFP_KERNEL); + if (!path) { + err = -ENOMEM; + goto out; + } + err = ctx->read(path, len, ctx); + if (err) { + kfree(path); + goto out; + } + + /* Just open file + TODO: open with correct flags */ + file = filp_open(path, O_RDONLY, 0); + kfree(path); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto out; + } + } + + down_write(&mm->mmap_sem); + addr = do_mmap_pgoff(file, vmai.cpt_start, + vmai.cpt_end - vmai.cpt_start, + prot, make_flags(&vmai), + vmai.cpt_pgoff); + + if (addr != vmai.cpt_start) { + up_write(&mm->mmap_sem); + + err = -EINVAL; + if (IS_ERR((void*)addr)) + err = addr; + goto out; + } + + vma = find_vma(mm, vmai.cpt_start); + if (vma == NULL) { + up_write(&mm->mmap_sem); + eprintk("cannot find mmapped vma\n"); + err = -ESRCH; + goto out; + } + + /* do_mmap_pgoff() can merge new area to previous one (not to the next, + * we mmap in order, the rest of mm is still unmapped). This can happen + * f.e. if flags are to be adjusted later, or if we had different + * anon_vma on two adjacent regions. Split it by brute force. */ + if (vma->vm_start != vmai.cpt_start) { + err = split_vma(mm, vma, (unsigned long)vmai.cpt_start, 0); + if (err) { + up_write(&mm->mmap_sem); + eprintk("cannot split vma\n"); + goto out; + } + } + up_write(&mm->mmap_sem); + + for (i = 0; i < vmai.cpt_page_num; i++) { + struct cpt_page_block pb; + + err = rst_get_object(CPT_OBJ_PAGES, &pb, sizeof(pb), ctx); + if (err) + goto out; + if (!(vmai.cpt_flags & VM_ACCOUNT) && !(prot & PROT_WRITE)) { + /* I guess this is get_user_pages() messed things, + * this happens f.e. when gdb inserts breakpoints. + */ + int j; + for (j = 0; j < (pb.cpt_end-pb.cpt_start)/PAGE_SIZE; j++) { + struct page *page; + void *maddr; + err = get_user_pages(current, current->mm, + (unsigned long)pb.cpt_start + + j * PAGE_SIZE, + 1, 1, 1, &page, NULL); + if (err == 0) + err = -EFAULT; + if (err < 0) { + eprintk("get_user_pages: %d\n", err); + goto out; + } + err = 0; + maddr = kmap(page); + if (pb.cpt_content == CPT_CONTENT_VOID) { + memset(maddr, 0, PAGE_SIZE); + } else if (pb.cpt_content == CPT_CONTENT_DATA) { + err = ctx->read(maddr, PAGE_SIZE, ctx); + if (err) { + kunmap(page); + goto out; + } + } else { + err = -EINVAL; + kunmap(page); + goto out; + } + set_page_dirty_lock(page); + kunmap(page); + page_cache_release(page); + } + } else { + if (!(prot & PROT_WRITE)) + sys_mprotect(vmai.cpt_start, + vmai.cpt_end - vmai.cpt_start, + prot | PROT_WRITE); + if (pb.cpt_content == CPT_CONTENT_VOID) { + int j; + for (j=0; j<(pb.cpt_end-pb.cpt_start)/sizeof(unsigned long); j++) { + err = __put_user(0UL, ((unsigned long __user*)(unsigned long)pb.cpt_start) + j); + if (err) { + eprintk("__put_user 2 %d\n", err); + goto out; + } + } + } else if (pb.cpt_content == CPT_CONTENT_DATA) { + err = ctx->read(cpt_ptr_import(pb.cpt_start), + pb.cpt_end - pb.cpt_start, + ctx); + if (err) + goto out; + } else { + err = -EINVAL; + goto out; + } + if (!(prot & PROT_WRITE)) + sys_mprotect(vmai.cpt_start, + vmai.cpt_end - vmai.cpt_start, + prot); + } + } + +out: + if (file) + fput(file); + return err; +} + +static int rst_restore_mm_context(struct cpt_context *ctx) +{ + struct cpt_obj_bits b; + struct mm_struct *mm = current->mm; + int oldsize = mm->context.size; + int err; + void *oldldt; + void *newldt; + + err = rst_get_object(CPT_OBJ_BITS, &b, sizeof(b), ctx); + if (err) + return err; + + if (b.cpt_size > PAGE_SIZE) + newldt = vmalloc(b.cpt_size); + else + newldt = kmalloc(b.cpt_size, GFP_KERNEL); + + if (!newldt) + return -ENOMEM; + + err = ctx->read(newldt, b.cpt_size, ctx); + if (err) + return err; + + oldldt = mm->context.ldt; + mm->context.ldt = newldt; + mm->context.size = b.cpt_size / LDT_ENTRY_SIZE; + + load_LDT(&mm->context); + + if (oldsize) { + if (oldsize * LDT_ENTRY_SIZE > PAGE_SIZE) + vfree(oldldt); + else + kfree(oldldt); + } + + return 0; +} + +int rst_restore_mm(struct cpt_context *ctx) +{ + int err; + int i; + struct mm_struct *mm = current->mm; + struct cpt_mm_image m; + + err = rst_get_object(CPT_OBJ_MM, &m, sizeof(m), ctx); + if (err) + return err; + + down_write(&mm->mmap_sem); + do_munmap(mm, 0, TASK_SIZE); + + mm->start_code = m.cpt_start_code; + mm->end_code = m.cpt_end_code; + mm->start_data = m.cpt_start_data; + mm->end_data = m.cpt_end_data; + mm->start_brk = m.cpt_start_brk; + mm->brk = m.cpt_brk; + mm->start_stack = m.cpt_start_stack; + mm->arg_start = m.cpt_start_arg; + mm->arg_end = m.cpt_end_arg; + mm->env_start = m.cpt_start_env; + mm->env_end = m.cpt_end_env; + mm->def_flags = m.cpt_def_flags; + mm->flags = m.cpt_flags; + + up_write(&mm->mmap_sem); + + for (i = 0; i < m.cpt_map_count; i++) { + err = rst_restore_one_vma(ctx); + if (err < 0) + goto out; + } + + err = rst_restore_mm_context(ctx); +out: + return err; +} + diff --git a/cpt/rst_process.c b/cpt/rst_process.c index 6d47f3c..84be46e 100644 --- a/cpt/rst_process.c +++ b/cpt/rst_process.c @@ -210,7 +210,8 @@ static int restart_thread(void *arg) err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx); if (!err) err = rst_restore_task_struct(current, ti, ctx); - /* Restore mm here */ + if (!err) + err = rst_restore_mm(ctx); if (!err) err = rst_restore_fpustate(current, ti, ctx); if (!err) diff --git a/mm/mmap.c b/mm/mmap.c index 971d0ed..98d1ba9 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1858,6 +1858,7 @@ int split_vma(struct mm_struct * mm, struct vm_area_struct * vma, return 0; } +EXPORT_SYMBOL(split_vma); /* Munmap is split into 2 main parts -- this part which finds * what needs doing, and the areas themselves, which do the diff --git a/mm/mprotect.c b/mm/mprotect.c index fded06f..47c7d75 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -22,6 +22,7 @@ #include <linux/swap.h> #include <linux/swapops.h> #include <linux/mmu_notifier.h> +#include <linux/module.h> #include <asm/uaccess.h> #include <asm/pgtable.h> #include <asm/cacheflush.h> @@ -317,3 +318,4 @@ out: up_write(¤t->mm->mmap_sem); return error; } +EXPORT_SYMBOL(sys_mprotect); -- 1.5.6 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: [PATCH 8/9] Introduce functions to restart a process 2008-09-03 10:57 ` [PATCH 8/9] Introduce functions to restart a process Andrey Mirkin 2008-09-03 10:57 ` [PATCH 9/9] Introduce functions to restore mm Andrey Mirkin @ 2008-09-03 14:32 ` Louis Rilling 2008-09-13 17:34 ` Pavel Machek 1 sibling, 1 reply; 69+ messages in thread From: Louis Rilling @ 2008-09-03 14:32 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers [-- Attachment #1: Type: text/plain, Size: 10644 bytes --] On Wed, Sep 03, 2008 at 02:57:55PM +0400, Andrey Mirkin wrote: > Functions to restart process, restore its state, fpu and registers are added. > > Signed-off-by: Andrey Mirkin <major@openvz.org> > --- > arch/x86/kernel/entry_32.S | 21 +++ > arch/x86/kernel/process_32.c | 3 + > cpt/Makefile | 3 +- > cpt/cpt.h | 2 + > cpt/restart.c | 2 +- > cpt/rst_process.c | 277 ++++++++++++++++++++++++++++++++++++++++++ > kernel/sched.c | 1 + > 7 files changed, 307 insertions(+), 2 deletions(-) > create mode 100644 cpt/rst_process.c > > diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S > index 109792b..a4848a3 100644 > --- a/arch/x86/kernel/entry_32.S > +++ b/arch/x86/kernel/entry_32.S > @@ -225,6 +225,7 @@ ENTRY(ret_from_fork) > GET_THREAD_INFO(%ebp) > popl %eax > CFI_ADJUST_CFA_OFFSET -4 > +ret_from_fork_tail: > pushl $0x0202 # Reset kernel eflags > CFI_ADJUST_CFA_OFFSET 4 > popfl > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork) > CFI_ENDPROC > END(ret_from_fork) > > +ENTRY(i386_ret_from_resume) > + CFI_STARTPROC > + pushl %eax > + CFI_ADJUST_CFA_OFFSET 4 > + call schedule_tail > + GET_THREAD_INFO(%ebp) > + popl %eax > + CFI_ADJUST_CFA_OFFSET -4 > + movl (%esp), %eax > + testl %eax, %eax > + jz 1f > + pushl %esp > + call *%eax > + addl $4, %esp > +1: > + addl $256, %esp This space reserved for hooks deserves some explanations IMO. I partly know the reasons, since I have looked at OpenVZ code with you, but I'm sure that almost nobody knows... Btw, the constant HOOK_RESERVE should be defined in a common place for assembly and C as well. > + jmp ret_from_fork_tail > + CFI_ENDPROC > +END(i386_ret_from_resume) > + > /* > * Return to user mode is not as complex as all this looks, > * but we want the default path for a system call return to [...] > diff --git a/cpt/rst_process.c b/cpt/rst_process.c > new file mode 100644 > index 0000000..6d47f3c > --- /dev/null > +++ b/cpt/rst_process.c > @@ -0,0 +1,277 @@ > +/* > + * Copyright (C) 2008 Parallels, Inc. > + * > + * Author: Andrey Mirkin <major@openvz.org> > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License as > + * published by the Free Software Foundation, version 2 of the > + * License. > + * > + */ > + > +#include <linux/sched.h> > +#include <linux/fs.h> > +#include <linux/file.h> > +#include <linux/version.h> > +#include <linux/module.h> > + > +#include "cpt.h" > +#include "cpt_image.h" > + > +#define HOOK_RESERVE 256 > + > +struct thr_context { > + struct completion complete; > + int error; > + struct cpt_context *ctx; > + struct task_struct *tsk; > +}; > + > +int local_kernel_thread(int (*fn)(void *), void * arg, unsigned long flags, pid_t pid) > +{ > + pid_t ret; > + > + if (current->fs == NULL) { > + /* do_fork_pid() hates processes without fs, oopses. */ > + eprintk("local_kernel_thread: current->fs==NULL\n"); > + return -EINVAL; > + } > + if (!try_module_get(THIS_MODULE)) > + return -EBUSY; Doing try_module_get(THIS_MODULE) and checking its result is almost always the symptom of a BUG. I explain myself: If it is possible that THIS_MODULE is currently unloading, its memory (especially its text section) can be freed at any time during the call to local_kernel_thread(), which leads to a beautiful OOPS. So, either the code is buggy, or this check is useless. I agree however that the module ref count must be incremented, since the new thread will run code from it. > + ret = kernel_thread(fn, arg, flags); > + if (ret < 0) > + module_put(THIS_MODULE); > + return ret; > +} > + > +static unsigned int decode_task_flags(unsigned int task_flags) > +{ > + unsigned int flags = 0; > + > + if (task_flags & (1 << CPT_PF_EXITING)) > + flags |= PF_EXITING; > + if (task_flags & (1 << CPT_PF_FORKNOEXEC)) > + flags |= PF_FORKNOEXEC; > + if (task_flags & (1 << CPT_PF_SUPERPRIV)) > + flags |= PF_SUPERPRIV; > + if (task_flags & (1 << CPT_PF_DUMPCORE)) > + flags |= PF_DUMPCORE; > + if (task_flags & (1 << CPT_PF_SIGNALED)) > + flags |= PF_SIGNALED; > + > + return flags; > + > +} > + > +int rst_restore_task_struct(struct task_struct *tsk, struct cpt_task_image *ti, > + struct cpt_context *ctx) > +{ > + int i; > + > + /* Restore only saved flags, comm and tls for now */ > + tsk->flags = decode_task_flags(ti->cpt_flags); > + clear_tsk_thread_flag(tsk, TIF_FREEZE); > + memcpy(tsk->comm, ti->cpt_comm, TASK_COMM_LEN); > + for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++) { > + tsk->thread.tls_array[i].a = ti->cpt_tls[i] & 0xFFFFFFFF; > + tsk->thread.tls_array[i].b = ti->cpt_tls[i] >> 32; > + } > + > + return 0; > +} > + > +static int rst_restore_fpustate(struct task_struct *tsk, struct cpt_task_image *ti, > + struct cpt_context *ctx) > +{ > + struct cpt_obj_bits hdr; > + int err; > + char *buf; > + > + clear_stopped_child_used_math(tsk); > + > + err = rst_get_object(CPT_OBJ_BITS, &hdr, sizeof(hdr), ctx); > + if (err < 0) > + return err; > + > + buf = kmalloc(hdr.cpt_size, GFP_KERNEL); > + if (!buf) > + return -ENOMEM; > + > + err = ctx->read(buf, hdr.cpt_size, ctx); > + if (err) > + goto out; > + > + if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE && cpu_has_fxsr) { > + memcpy(&tsk->thread.xstate, buf, > + sizeof(struct i387_fxsave_struct)); > + if (ti->cpt_flags & CPT_PF_USED_MATH) > + set_stopped_child_used_math(tsk); > + } > +#ifndef CONFIG_X86_64 > + else if (hdr.cpt_content == CPT_CONTENT_X86_FPUSTATE_OLD && > + !cpu_has_fxsr) { > + memcpy(&tsk->thread.xstate, buf, > + sizeof(struct i387_fsave_struct)); > + if (ti->cpt_flags & CPT_PF_USED_MATH) > + set_stopped_child_used_math(tsk); > + } > +#endif > + > +out: > + kfree(buf); > + return err; > +} > + > +static u32 decode_segment(u32 segid) > +{ > + if (segid == CPT_SEG_ZERO) > + return 0; > + > + /* TLS descriptors */ > + if (segid <= CPT_SEG_TLS3) > + return ((GDT_ENTRY_TLS_MIN + segid - CPT_SEG_TLS1) << 3) + 3; > + > + /* LDT descriptor, it is just an index to LDT array */ > + if (segid >= CPT_SEG_LDT) > + return ((segid - CPT_SEG_LDT) << 3) | 7; > + > + /* Check for one of standard descriptors */ > + if (segid == CPT_SEG_USER32_DS) > + return __USER_DS; > + if (segid == CPT_SEG_USER32_CS) > + return __USER_CS; > + > + eprintk("Invalid segment reg %d\n", segid); > + return 0; > +} > + > +static int rst_restore_registers(struct task_struct *tsk, struct cpt_context *ctx) > +{ > + struct cpt_x86_regs ri; > + struct pt_regs *regs = task_pt_regs(tsk); > + extern char i386_ret_from_resume; > + int err; > + > + err = rst_get_object(CPT_OBJ_X86_REGS, &ri, sizeof(ri), ctx); > + if (err < 0) > + return err; > + > + tsk->thread.sp = (unsigned long) regs; > + tsk->thread.sp0 = (unsigned long) (regs+1); > + tsk->thread.ip = (unsigned long) &i386_ret_from_resume; > + > + tsk->thread.gs = decode_segment(ri.cpt_gs); > + tsk->thread.debugreg0 = ri.cpt_debugreg[0]; > + tsk->thread.debugreg1 = ri.cpt_debugreg[1]; > + tsk->thread.debugreg2 = ri.cpt_debugreg[2]; > + tsk->thread.debugreg3 = ri.cpt_debugreg[3]; > + tsk->thread.debugreg6 = ri.cpt_debugreg[6]; > + tsk->thread.debugreg7 = ri.cpt_debugreg[7]; > + > + regs->bx = ri.cpt_bx; > + regs->cx = ri.cpt_cx; > + regs->dx = ri.cpt_dx; > + regs->si = ri.cpt_si; > + regs->di = ri.cpt_di; > + regs->bp = ri.cpt_bp; > + regs->ax = ri.cpt_ax; > + regs->orig_ax = ri.cpt_orig_ax; > + regs->ip = ri.cpt_ip; > + regs->flags = ri.cpt_flags; > + regs->sp = ri.cpt_sp; > + > + regs->cs = decode_segment(ri.cpt_cs); > + regs->ss = decode_segment(ri.cpt_ss); > + regs->ds = decode_segment(ri.cpt_ds); > + regs->es = decode_segment(ri.cpt_es); > + regs->fs = decode_segment(ri.cpt_fs); > + > + tsk->thread.sp -= HOOK_RESERVE; > + memset((void*)tsk->thread.sp, 0, HOOK_RESERVE); > + > + return 0; > +} > + > +static int restart_thread(void *arg) > +{ > + struct thr_context *thr_ctx = arg; > + struct cpt_context *ctx; > + struct cpt_task_image *ti; > + int err; > + > + current->state = TASK_UNINTERRUPTIBLE; > + > + ctx = thr_ctx->ctx; > + ti = kmalloc(sizeof(*ti), GFP_KERNEL); > + if (!ti) > + return -ENOMEM; > + > + err = rst_get_object(CPT_OBJ_TASK, ti, sizeof(*ti), ctx); > + if (!err) > + err = rst_restore_task_struct(current, ti, ctx); > + /* Restore mm here */ > + if (!err) > + err = rst_restore_fpustate(current, ti, ctx); > + if (!err) > + err = rst_restore_registers(current, ctx); > + > + thr_ctx->error = err; > + complete(&thr_ctx->complete); > + > + if (!err && (ti->cpt_state & (EXIT_ZOMBIE|EXIT_DEAD))) { > + do_exit(ti->cpt_exit_code); > + } else { > + __set_current_state(TASK_UNINTERRUPTIBLE); > + } > + > + kfree(ti); > + schedule(); > + > + eprintk("leaked %d/%d %p\n", task_pid_nr(current), task_pid_vnr(current), current->mm); > + > + module_put(THIS_MODULE); Where is the module ref count dropped in case of success? > + complete_and_exit(NULL, 0); > + return 0; > +} > +static int create_root_task(struct cpt_context *ctx, > + struct thr_context *thr_ctx) > +{ > + struct task_struct *tsk; > + int pid; > + > + thr_ctx->ctx = ctx; > + thr_ctx->error = 0; > + init_completion(&thr_ctx->complete); > + > + /* We should also create container here */ > + pid = local_kernel_thread(restart_thread, thr_ctx, > + CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | > + CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNET, 0); > + if (pid < 0) > + return pid; > + read_lock(&tasklist_lock); > + tsk = find_task_by_vpid(pid); > + if (tsk) > + get_task_struct(tsk); > + read_unlock(&tasklist_lock); > + if (tsk == NULL) > + return -ESRCH; > + thr_ctx->tsk = tsk; > + return 0; > +} > + > +int rst_restart_process(struct cpt_context *ctx) > +{ > + struct thr_context thr_ctx_root; > + int err; > + > + err = create_root_task(ctx, &thr_ctx_root); > + if (err) > + return err; > + > + wait_for_completion(&thr_ctx_root.complete); > + wait_task_inactive(thr_ctx_root.tsk, 0); I don't see the point of this two-steps completion. Could you explain? Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 8/9] Introduce functions to restart a process 2008-09-03 14:32 ` [PATCH 8/9] Introduce functions to restart a process Louis Rilling @ 2008-09-13 17:34 ` Pavel Machek 0 siblings, 0 replies; 69+ messages in thread From: Pavel Machek @ 2008-09-13 17:34 UTC (permalink / raw) To: Louis Rilling; +Cc: Andrey Mirkin, linux-kernel, containers Hi! > > @@ -233,6 +234,26 @@ ENTRY(ret_from_fork) > > CFI_ENDPROC > > END(ret_from_fork) > > > > +ENTRY(i386_ret_from_resume) > > + CFI_STARTPROC > > + pushl %eax > > + CFI_ADJUST_CFA_OFFSET 4 > > + call schedule_tail > > + GET_THREAD_INFO(%ebp) > > + popl %eax > > + CFI_ADJUST_CFA_OFFSET -4 > > + movl (%esp), %eax > > + testl %eax, %eax > > + jz 1f > > + pushl %esp > > + call *%eax > > + addl $4, %esp > > +1: > > + addl $256, %esp Can we do something with naming here? resume usually means 'resume from s2ram'... > > diff --git a/cpt/rst_process.c b/cpt/rst_process.c > > new file mode 100644 > > index 0000000..6d47f3c > > --- /dev/null > > +++ b/cpt/rst_process.c > > @@ -0,0 +1,277 @@ ....and you are not even consistent. Plus, rst_ will probably be understood as a reset by most people. Hmmm... your syscalls get fd. What happens if I pass something like /proc/self/maps to them? -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 6/9] Introduce functions to dump mm 2008-09-03 10:57 ` [PATCH 6/9] Introduce functions to dump mm Andrey Mirkin 2008-09-03 10:57 ` [PATCH 7/9] Introduce function for restarting a container Andrey Mirkin @ 2008-09-03 14:17 ` Louis Rilling 1 sibling, 0 replies; 69+ messages in thread From: Louis Rilling @ 2008-09-03 14:17 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers [-- Attachment #1: Type: text/plain, Size: 11734 bytes --] On Wed, Sep 03, 2008 at 02:57:53PM +0400, Andrey Mirkin wrote: > Functions to dump mm struct, VMAs and mm context are added. > [...] > diff --git a/cpt/cpt_mm.c b/cpt/cpt_mm.c > new file mode 100644 > index 0000000..6e025cc > --- /dev/null > +++ b/cpt/cpt_mm.c > @@ -0,0 +1,431 @@ > +/* > + * Copyright (C) 2008 Parallels, Inc. > + * > + * Authors: Andrey Mirkin <major@openvz.org> > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License as > + * published by the Free Software Foundation, version 2 of the > + * License. > + * > + */ > + > +#include <linux/sched.h> > +#include <linux/slab.h> > +#include <linux/file.h> > +#include <linux/mm.h> > +#include <linux/errno.h> > +#include <linux/major.h> > +#include <linux/mman.h> > +#include <linux/mnt_namespace.h> > +#include <linux/mount.h> > +#include <linux/namei.h> > +#include <linux/pagemap.h> > +#include <linux/hugetlb.h> > +#include <asm/ldt.h> > + > +#include "cpt.h" > +#include "cpt_image.h" > + > +struct page_area > +{ > + int type; > + unsigned long start; > + unsigned long end; > + pgoff_t pgoff; > + loff_t mm; > + __u64 list[16]; > +}; > + > +struct page_desc > +{ > + int type; > + pgoff_t index; > + loff_t mm; > + int shared; > +}; > + > +enum { > + PD_ABSENT, > + PD_COPY, > + PD_FUNKEY, > +}; > + > +/* 0: page can be obtained from backstore, or still not mapped anonymous page, > + or something else, which does not requre copy. > + 1: page requires copy > + 2: page requres copy but its content is zero. Quite useless. > + 3: wp page is shared after fork(). It is to be COWed when modified. > + 4: page is something unsupported... We copy it right now. > + */ > + > +static void page_get_desc(struct vm_area_struct *vma, unsigned long addr, > + struct page_desc *pdesc, cpt_context_t * ctx) > +{ > + struct mm_struct *mm = vma->vm_mm; > + pgd_t *pgd; > + pud_t *pud; > + pmd_t *pmd; > + pte_t *ptep, pte; > + spinlock_t *ptl; > + struct page *pg = NULL; > + pgoff_t linear_index = (addr - vma->vm_start)/PAGE_SIZE + vma->vm_pgoff; > + > + pdesc->index = linear_index; > + pdesc->shared = 0; > + pdesc->mm = CPT_NULL; > + > + if (vma->vm_flags & VM_IO) { > + pdesc->type = PD_ABSENT; > + return; > + } > + > + pgd = pgd_offset(mm, addr); > + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) > + goto out_absent; > + pud = pud_offset(pgd, addr); > + if (pud_none(*pud) || unlikely(pud_bad(*pud))) > + goto out_absent; > + pmd = pmd_offset(pud, addr); > + if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd))) > + goto out_absent; > +#ifdef CONFIG_X86 > + if (pmd_huge(*pmd)) { > + eprintk("page_huge\n"); > + goto out_unsupported; > + } > +#endif > + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); > + pte = *ptep; > + pte_unmap(ptep); > + > + if (pte_none(pte)) > + goto out_absent_unlock; > + > + if ((pg = vm_normal_page(vma, addr, pte)) == NULL) { > + pdesc->type = PD_COPY; > + goto out_unlock; > + } > + > + get_page(pg); > + spin_unlock(ptl); > + > + if (pg->mapping && !PageAnon(pg)) { > + if (vma->vm_file == NULL) { > + eprintk("pg->mapping!=NULL for fileless vma: %08lx\n", addr); > + goto out_unsupported; > + } > + if (vma->vm_file->f_mapping != pg->mapping) { > + eprintk("pg->mapping!=f_mapping: %08lx %p %p\n", > + addr, vma->vm_file->f_mapping, pg->mapping); > + goto out_unsupported; > + } > + pdesc->index = (pg->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT)); > + /* Page is in backstore. For us it is like > + * it is not present. > + */ > + goto out_absent; > + } > + > + if (PageReserved(pg)) { > + /* Special case: ZERO_PAGE is used, when an > + * anonymous page is accessed but not written. */ > + if (pg == ZERO_PAGE(addr)) { > + if (pte_write(pte)) { > + eprintk("not funny already, writable ZERO_PAGE\n"); > + goto out_unsupported; > + } > + goto out_absent; > + } > + eprintk("reserved page %lu at %08lx\n", pg->index, addr); > + goto out_unsupported; > + } > + > + if (!pg->mapping) { > + eprintk("page without mapping at %08lx\n", addr); > + goto out_unsupported; > + } > + > + pdesc->type = PD_COPY; > + > +out_put: > + if (pg) > + put_page(pg); > + return; > + > +out_unlock: > + spin_unlock(ptl); > + goto out_put; > + > +out_absent_unlock: > + spin_unlock(ptl); > + > +out_absent: > + pdesc->type = PD_ABSENT; > + goto out_put; > + > +out_unsupported: > + pdesc->type = PD_FUNKEY; > + goto out_put; > +} > + > +static int count_vma_pages(struct vm_area_struct *vma, struct cpt_context *ctx) > +{ > + unsigned long addr; > + int page_num = 0; > + > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { > + struct page_desc pd; > + > + page_get_desc(vma, addr, &pd, ctx); > + > + if (pd.type != PD_COPY) { > + return -EINVAL; > + } else { > + page_num += 1; > + } > + > + } > + return page_num; > +} > + > +/* ATTN: We give "current" to get_user_pages(). This is wrong, but get_user_pages() > + * does not really need this thing. It just stores some page fault stats there. > + * > + * BUG: some archs (f.e. sparc64, but not Intel*) require flush cache pages > + * before accessing vma. > + */ > +static int dump_pages(struct vm_area_struct *vma, unsigned long start, > + unsigned long end, struct cpt_context *ctx) > +{ > +#define MAX_PAGE_BATCH 16 > + struct page *pg[MAX_PAGE_BATCH]; > + int npages = (end - start)/PAGE_SIZE; > + int count = 0; > + > + while (count < npages) { > + int copy = npages - count; > + int n; > + > + if (copy > MAX_PAGE_BATCH) > + copy = MAX_PAGE_BATCH; > + n = get_user_pages(current, vma->vm_mm, start, copy, > + 0, 1, pg, NULL); > + if (n == copy) { > + int i; > + for (i=0; i<n; i++) { > + char *maddr = kmap(pg[i]); > + ctx->write(maddr, PAGE_SIZE, ctx); > + kunmap(pg[i]); > + } > + } else { > + eprintk("get_user_pages fault"); > + for ( ; n > 0; n--) > + page_cache_release(pg[n-1]); > + return -EFAULT; > + } > + start += n*PAGE_SIZE; > + count += n; > + for ( ; n > 0; n--) > + page_cache_release(pg[n-1]); > + } > + return 0; > +} > + > +static int dump_page_block(struct vm_area_struct *vma, > + struct cpt_page_block *pgb, > + struct cpt_context *ctx) > +{ > + int err; > + pgb->cpt_len = sizeof(*pgb) + pgb->cpt_end - pgb->cpt_start; > + pgb->cpt_type = CPT_OBJ_PAGES; > + pgb->cpt_hdrlen = sizeof(*pgb); > + pgb->cpt_content = CPT_CONTENT_DATA; > + > + err = ctx->write(pgb, sizeof(*pgb), ctx); > + if (!err) > + err = dump_pages(vma, pgb->cpt_start, pgb->cpt_end, ctx); > + > + return err; > +} > + > +static int cpt_dump_dentry(struct path *p, cpt_context_t *ctx) > +{ > + int len; > + char *path; > + char *buf; > + struct cpt_object_hdr o; > + > + buf = (char *)__get_free_page(GFP_KERNEL); > + if (!buf) > + return -ENOMEM; > + > + path = d_path(p, buf, PAGE_SIZE); > + > + if (IS_ERR(path)) { > + free_page((unsigned long)buf); > + return PTR_ERR(path); > + } > + > + len = buf + PAGE_SIZE - 1 - path; > + o.cpt_len = sizeof(o) + len + 1; > + o.cpt_type = CPT_OBJ_NAME; > + o.cpt_hdrlen = sizeof(o); > + o.cpt_content = CPT_CONTENT_NAME; > + path[len] = 0; > + > + ctx->write(&o, sizeof(o), ctx); > + ctx->write(path, len + 1, ctx); > + free_page((unsigned long)buf); > + > + return 0; > +} > + > +static int dump_one_vma(struct mm_struct *mm, > + struct vm_area_struct *vma, struct cpt_context *ctx) > +{ > + struct cpt_vma_image *v; > + unsigned long addr; > + int page_num; > + int err; > + > + v = kzalloc(sizeof(*v), GFP_KERNEL); > + if (!v) > + return -ENOMEM; > + > + v->cpt_len = sizeof(*v); > + v->cpt_type = CPT_OBJ_VMA; > + v->cpt_hdrlen = sizeof(*v); > + v->cpt_content = CPT_CONTENT_ARRAY; > + > + v->cpt_start = vma->vm_start; > + v->cpt_end = vma->vm_end; > + v->cpt_flags = vma->vm_flags; > + if (vma->vm_flags & VM_HUGETLB) { > + eprintk("huge TLB VMAs are still not supported\n"); > + kfree(v); > + return -EINVAL; > + } > + v->cpt_pgprot = vma->vm_page_prot.pgprot; > + v->cpt_pgoff = vma->vm_pgoff; > + v->cpt_file = CPT_NULL; > + v->cpt_vma_type = CPT_VMA_TYPE_0; > + > + page_num = count_vma_pages(vma, ctx); > + if (page_num < 0) { > + kfree(v); > + return -EINVAL; > + } > + v->cpt_page_num = page_num; > + > + if (vma->vm_file) { > + v->cpt_file = 0; > + v->cpt_vma_type = CPT_VMA_FILE; > + } > + > + ctx->write(v, sizeof(*v), ctx); > + kfree(v); > + > + if (vma->vm_file) { > + err = cpt_dump_dentry(&vma->vm_file->f_path, ctx); > + if (err < 0) > + return err; > + } > + > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) { > + struct page_desc pd; > + struct cpt_page_block pgb; > + > + page_get_desc(vma, addr, &pd, ctx); > + > + if (pd.type == PD_FUNKEY || pd.type == PD_ABSENT) { > + eprintk("dump_one_vma: funkey page\n"); > + return -EINVAL; IIUC, as soon as ZERO_PAGE is hit, checkpoint aborts. I doubt that many applications will be checkpointable with this limitation. > + } > + > + pgb.cpt_start = addr; > + pgb.cpt_end = addr + PAGE_SIZE; > + dump_page_block(vma, &pgb, ctx); > + } > + > + return 0; > +} > + > +static int cpt_dump_mm_context(struct mm_struct *mm, struct cpt_context *ctx) > +{ > +#ifdef CONFIG_X86 > + if (mm->context.size) { > + struct cpt_obj_bits b; > + int size; > + > + mutex_lock(&mm->context.lock); > + > + b.cpt_type = CPT_OBJ_BITS; > + b.cpt_len = sizeof(b); > + b.cpt_content = CPT_CONTENT_MM_CONTEXT; > + b.cpt_size = mm->context.size * LDT_ENTRY_SIZE; > + > + ctx->write(&b, sizeof(b), ctx); > + > + size = mm->context.size * LDT_ENTRY_SIZE; > + > + ctx->write(mm->context.ldt, size, ctx); > + > + mutex_unlock(&mm->context.lock); > + } > +#endif > + return 0; > +} > + > +int cpt_dump_mm(struct task_struct *tsk, struct cpt_context *ctx) > +{ > + struct mm_struct *mm = tsk->mm; > + struct cpt_mm_image *v; > + struct vm_area_struct *vma; > + int err; > + > + v = kzalloc(sizeof(*v), GFP_KERNEL); > + if (!v) > + return -ENOMEM; > + > + v->cpt_len = sizeof(*v); > + v->cpt_type = CPT_OBJ_MM; > + v->cpt_hdrlen = sizeof(*v); > + v->cpt_content = CPT_CONTENT_ARRAY; > + > + v->cpt_start_code = mm->start_code; > + v->cpt_end_code = mm->end_code; > + v->cpt_start_data = mm->start_data; > + v->cpt_end_data = mm->end_data; > + v->cpt_start_brk = mm->start_brk; > + v->cpt_brk = mm->brk; > + v->cpt_start_stack = mm->start_stack; > + v->cpt_start_arg = mm->arg_start; > + v->cpt_end_arg = mm->arg_end; > + v->cpt_start_env = mm->env_start; > + v->cpt_end_env = mm->env_end; > + v->cpt_def_flags = mm->def_flags; > + v->cpt_flags = mm->flags; > + v->cpt_map_count = mm->map_count; > + > + err = ctx->write(v, sizeof(*v), ctx); > + kfree(v); > + > + if (err) { > + eprintk("error during writing mm\n"); > + return err; > + } > + > + for (vma = mm->mmap; vma; vma = vma->vm_next) { > + int err; > + > + if ((err = dump_one_vma(mm, vma, ctx)) != 0) > + return err; > + } Why not down_reading mm->mmap_sem here, while it is properly down_written during restart? Is it because we can rely on the tasks being frozen during checkpoint, and not during restart? Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/9] Introduce container dump function 2008-09-03 10:57 ` [PATCH 4/9] Introduce container dump function Andrey Mirkin 2008-09-03 10:57 ` [PATCH 5/9] Introduce function to dump process Andrey Mirkin @ 2008-09-03 14:23 ` Serge E. Hallyn 2008-09-03 14:45 ` Andrey Mirkin 1 sibling, 1 reply; 69+ messages in thread From: Serge E. Hallyn @ 2008-09-03 14:23 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers Quoting Andrey Mirkin (major@openvz.org): > Actually right now we are going to dump only one process. > Function for dumping head of image file are added. > > Signed-off-by: Andrey Mirkin <major@openvz.org> > --- > cpt/Makefile | 2 +- > cpt/checkpoint.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > cpt/cpt.h | 3 ++ > cpt/sys.c | 3 +- > kernel/fork.c | 2 + > 5 files changed, 82 insertions(+), 2 deletions(-) > create mode 100644 cpt/checkpoint.c > > diff --git a/cpt/Makefile b/cpt/Makefile > index bfe75d5..173346b 100644 > --- a/cpt/Makefile > +++ b/cpt/Makefile > @@ -2,4 +2,4 @@ obj-y += sys_core.o > > obj-$(CONFIG_CHECKPOINT) += cptrst.o > > -cptrst-objs := sys.o > +cptrst-objs := sys.o checkpoint.o > diff --git a/cpt/checkpoint.c b/cpt/checkpoint.c > new file mode 100644 > index 0000000..b4d9686 > --- /dev/null > +++ b/cpt/checkpoint.c > @@ -0,0 +1,74 @@ > +/* > + * Copyright (C) 2008 Parallels, Inc. > + * > + * Author: Andrey Mirkin <major@openvz.org> > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License as > + * published by the Free Software Foundation, version 2 of the > + * License. > + * > + */ > + > +#include <linux/sched.h> > +#include <linux/fs.h> > +#include <linux/file.h> > +#include <linux/version.h> > + > +#include "cpt.h" > + > +static int cpt_write_head(struct cpt_context *ctx) > +{ > + struct cpt_head hdr; > + > + memset(&hdr, 0, sizeof(hdr)); > + hdr.cpt_signature[0] = CPT_SIGNATURE0; > + hdr.cpt_signature[1] = CPT_SIGNATURE1; > + hdr.cpt_signature[2] = CPT_SIGNATURE2; > + hdr.cpt_signature[3] = CPT_SIGNATURE3; > + hdr.cpt_hdrlen = sizeof(hdr); > + hdr.cpt_image_major = (LINUX_VERSION_CODE >> 16) & 0xff; > + hdr.cpt_image_minor = (LINUX_VERSION_CODE >> 8) & 0xff; > + hdr.cpt_image_sublevel = (LINUX_VERSION_CODE) & 0xff; > + hdr.cpt_image_extra = 0; > +#if defined(CONFIG_X86_32) > + hdr.cpt_arch = CPT_ARCH_I386; > +#else > +#error Arch is not supported > +#endif > + return ctx->write(&hdr, sizeof(hdr), ctx); > +} > + > +int dump_container(struct cpt_context *ctx) > +{ > + int err; > + struct task_struct *root; > + > + read_lock(&tasklist_lock); > + root = find_task_by_vpid(ctx->pid); > + if (root) > + get_task_struct(root); > + read_unlock(&tasklist_lock); > + > + err = -ESRCH; > + if (!root) { > + eprintk("can not find root task\n"); > + return err; > + } > + ctx->nsproxy = root->nsproxy; > + if (!ctx->nsproxy) { > + eprintk("nsproxy is null\n"); > + goto out; > + } The get_task_struct() above won't pin the tsk->nsproxy though, will it? So should you be doing a rcu_read_lock(); nsproxy = get_task_nsproxy(root); rcu_read_unlock(); to make sure the nsproxy doesn't disappear out from under you? (Maybe you do it in a later patch...) thanks, -serge > + err = cpt_write_head(ctx); > + > + /* Dump task here */ > + if (!err) > + err = -ENOSYS; > + > +out: > + ctx->nsproxy = NULL; > + put_task_struct(root); > + return err; > +} > diff --git a/cpt/cpt.h b/cpt/cpt.h > index 607ac1b..b421a11 100644 > --- a/cpt/cpt.h > +++ b/cpt/cpt.h > @@ -33,6 +33,7 @@ typedef struct cpt_context > int refcount; > int ctx_state; > struct semaphore main_sem; > + struct nsproxy *nsproxy; > > int errno; > > @@ -54,3 +55,5 @@ extern int debug_level; > > #define eprintk(a...) cpt_printk(1, "CPT ERR: " a) > #define dprintk(a...) cpt_printk(1, "CPT DBG: " a) > + > +int dump_container(struct cpt_context *ctx); > diff --git a/cpt/sys.c b/cpt/sys.c > index 8334c4c..6801c22 100644 > --- a/cpt/sys.c > +++ b/cpt/sys.c > @@ -109,9 +109,10 @@ static int checkpoint(pid_t pid, int fd, unsigned long flags) > > ctx->file = file; > ctx->ctx_state = CPT_CTX_DUMPING; > + ctx->pid = pid; > > /* checkpoint */ > - err = -ENOSYS; > + err = dump_container(ctx); > > context_put(ctx); > > diff --git a/kernel/fork.c b/kernel/fork.c > index 52b5037..f38b43d 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -77,6 +77,7 @@ int max_threads; /* tunable limit on nr_threads */ > DEFINE_PER_CPU(unsigned long, process_counts) = 0; > > __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */ > +EXPORT_SYMBOL(tasklist_lock); > > int nr_processes(void) > { > @@ -153,6 +154,7 @@ void __put_task_struct(struct task_struct *tsk) > if (!profile_handoff_task(tsk)) > free_task(tsk); > } > +EXPORT_SYMBOL(__put_task_struct); > > /* > * macro override instead of weak attribute alias, to workaround > -- > 1.5.6 > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 4/9] Introduce container dump function 2008-09-03 14:23 ` [PATCH 4/9] Introduce container dump function Serge E. Hallyn @ 2008-09-03 14:45 ` Andrey Mirkin 0 siblings, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 14:45 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: linux-kernel, containers On Wednesday 03 September 2008 18:23 Serge E. Hallyn wrote: > Quoting Andrey Mirkin (major@openvz.org): > > Actually right now we are going to dump only one process. > > Function for dumping head of image file are added. > > > > Signed-off-by: Andrey Mirkin <major@openvz.org> > > --- > > cpt/Makefile | 2 +- > > cpt/checkpoint.c | 74 > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++ cpt/cpt.h | > > 3 ++ > > cpt/sys.c | 3 +- > > kernel/fork.c | 2 + > > 5 files changed, 82 insertions(+), 2 deletions(-) > > create mode 100644 cpt/checkpoint.c > > > > diff --git a/cpt/Makefile b/cpt/Makefile > > index bfe75d5..173346b 100644 > > --- a/cpt/Makefile > > +++ b/cpt/Makefile > > @@ -2,4 +2,4 @@ obj-y += sys_core.o > > > > obj-$(CONFIG_CHECKPOINT) += cptrst.o > > > > -cptrst-objs := sys.o > > +cptrst-objs := sys.o checkpoint.o > > diff --git a/cpt/checkpoint.c b/cpt/checkpoint.c > > new file mode 100644 > > index 0000000..b4d9686 > > --- /dev/null > > +++ b/cpt/checkpoint.c > > @@ -0,0 +1,74 @@ > > +/* > > + * Copyright (C) 2008 Parallels, Inc. > > + * > > + * Author: Andrey Mirkin <major@openvz.org> > > + * > > + * This program is free software; you can redistribute it and/or > > + * modify it under the terms of the GNU General Public License as > > + * published by the Free Software Foundation, version 2 of the > > + * License. > > + * > > + */ > > + > > +#include <linux/sched.h> > > +#include <linux/fs.h> > > +#include <linux/file.h> > > +#include <linux/version.h> > > + > > +#include "cpt.h" > > + > > +static int cpt_write_head(struct cpt_context *ctx) > > +{ > > + struct cpt_head hdr; > > + > > + memset(&hdr, 0, sizeof(hdr)); > > + hdr.cpt_signature[0] = CPT_SIGNATURE0; > > + hdr.cpt_signature[1] = CPT_SIGNATURE1; > > + hdr.cpt_signature[2] = CPT_SIGNATURE2; > > + hdr.cpt_signature[3] = CPT_SIGNATURE3; > > + hdr.cpt_hdrlen = sizeof(hdr); > > + hdr.cpt_image_major = (LINUX_VERSION_CODE >> 16) & 0xff; > > + hdr.cpt_image_minor = (LINUX_VERSION_CODE >> 8) & 0xff; > > + hdr.cpt_image_sublevel = (LINUX_VERSION_CODE) & 0xff; > > + hdr.cpt_image_extra = 0; > > +#if defined(CONFIG_X86_32) > > + hdr.cpt_arch = CPT_ARCH_I386; > > +#else > > +#error Arch is not supported > > +#endif > > + return ctx->write(&hdr, sizeof(hdr), ctx); > > +} > > + > > +int dump_container(struct cpt_context *ctx) > > +{ > > + int err; > > + struct task_struct *root; > > + > > + read_lock(&tasklist_lock); > > + root = find_task_by_vpid(ctx->pid); > > + if (root) > > + get_task_struct(root); > > + read_unlock(&tasklist_lock); > > + > > + err = -ESRCH; > > + if (!root) { > > + eprintk("can not find root task\n"); > > + return err; > > + } > > + ctx->nsproxy = root->nsproxy; > > + if (!ctx->nsproxy) { > > + eprintk("nsproxy is null\n"); > > + goto out; > > + } > > The get_task_struct() above won't pin the tsk->nsproxy > though, will it? So should you be doing a > rcu_read_lock(); > nsproxy = get_task_nsproxy(root); > rcu_read_unlock(); > to make sure the nsproxy doesn't disappear out from under > you? > You right here, will fix it in next version. Thanks, Andrey > > > + err = cpt_write_head(ctx); > > + > > + /* Dump task here */ > > + if (!err) > > + err = -ENOSYS; > > + > > +out: > > + ctx->nsproxy = NULL; > > + put_task_struct(root); > > + return err; > > +} > > diff --git a/cpt/cpt.h b/cpt/cpt.h > > index 607ac1b..b421a11 100644 > > --- a/cpt/cpt.h > > +++ b/cpt/cpt.h > > @@ -33,6 +33,7 @@ typedef struct cpt_context > > int refcount; > > int ctx_state; > > struct semaphore main_sem; > > + struct nsproxy *nsproxy; > > > > int errno; > > > > @@ -54,3 +55,5 @@ extern int debug_level; > > > > #define eprintk(a...) cpt_printk(1, "CPT ERR: " a) > > #define dprintk(a...) cpt_printk(1, "CPT DBG: " a) > > + > > +int dump_container(struct cpt_context *ctx); > > diff --git a/cpt/sys.c b/cpt/sys.c > > index 8334c4c..6801c22 100644 > > --- a/cpt/sys.c > > +++ b/cpt/sys.c > > @@ -109,9 +109,10 @@ static int checkpoint(pid_t pid, int fd, unsigned > > long flags) > > > > ctx->file = file; > > ctx->ctx_state = CPT_CTX_DUMPING; > > + ctx->pid = pid; > > > > /* checkpoint */ > > - err = -ENOSYS; > > + err = dump_container(ctx); > > > > context_put(ctx); > > > > diff --git a/kernel/fork.c b/kernel/fork.c > > index 52b5037..f38b43d 100644 > > --- a/kernel/fork.c > > +++ b/kernel/fork.c > > @@ -77,6 +77,7 @@ int max_threads; /* tunable limit on nr_threads */ > > DEFINE_PER_CPU(unsigned long, process_counts) = 0; > > > > __cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */ > > +EXPORT_SYMBOL(tasklist_lock); > > > > int nr_processes(void) > > { > > @@ -153,6 +154,7 @@ void __put_task_struct(struct task_struct *tsk) > > if (!profile_handoff_task(tsk)) > > free_task(tsk); > > } > > +EXPORT_SYMBOL(__put_task_struct); > > > > /* > > * macro override instead of weak attribute alias, to workaround > > -- > > 1.5.6 > > > > _______________________________________________ > > Containers mailing list > > Containers@lists.linux-foundation.org > > https://lists.linux-foundation.org/mailman/listinfo/containers > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 3/9] Introduce context structure needed during checkpointing/restart 2008-09-03 10:57 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Andrey Mirkin 2008-09-03 10:57 ` [PATCH 4/9] Introduce container dump function Andrey Mirkin @ 2008-09-03 12:29 ` Matthieu Fertré 2008-09-03 14:11 ` Andrey Mirkin 2008-09-03 13:56 ` Louis Rilling 2008-09-03 14:13 ` Cedric Le Goater 3 siblings, 1 reply; 69+ messages in thread From: Matthieu Fertré @ 2008-09-03 12:29 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers Andrey Mirkin a écrit : > Add functions for context allocation/destroy. > Introduce functions to read/write image. > Introduce image header and object header. > > Signed-off-by: Andrey Mirkin <major@openvz.org> > --- > cpt/cpt.h | 37 ++++++++++++++++ > cpt/cpt_image.h | 63 +++++++++++++++++++++++++++ > cpt/sys.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++++++- > 3 files changed, 224 insertions(+), 3 deletions(-) > create mode 100644 cpt/cpt_image.h > > diff --git a/cpt/cpt.h b/cpt/cpt.h > index 381a9bf..607ac1b 100644 > --- a/cpt/cpt.h > +++ b/cpt/cpt.h > @@ -10,6 +10,8 @@ > * > */ > > +#include "cpt_image.h" > + > struct cpt_operations > { > struct module * owner; > @@ -17,3 +19,38 @@ struct cpt_operations > int (*restart)(int ctid, int fd, unsigned long flags); > }; > extern struct cpt_operations cpt_ops; > + > +#define CPT_CTX_ERROR -1 > +#define CPT_CTX_IDLE 0 > +#define CPT_CTX_DUMPING 1 > +#define CPT_CTX_UNDUMPING 2 > Maybe you can define an enum here instead. > + > +typedef struct cpt_context > +{ > + pid_t pid; /* should be changed to ctid later */ > + int ctx_id; /* context id */ > + struct list_head ctx_list; > + int refcount; > + int ctx_state; > It will be more clear that ctx_state refers to it. Thanks, Matthieu > + struct semaphore main_sem; > + > + int errno; > + > + struct file *file; > + loff_t current_object; > + > + struct list_head object_array[CPT_OBJ_MAX]; > + > + int (*write)(const void *addr, size_t count, struct cpt_context *ctx); > + int (*read)(void *addr, size_t count, struct cpt_context *ctx); > +} cpt_context_t; > + > +extern int debug_level; > + > +#define cpt_printk(lvl, fmt, args...) do { \ > + if (lvl <= debug_level) \ > + printk(fmt, ##args); \ > + } while (0) > + > +#define eprintk(a...) cpt_printk(1, "CPT ERR: " a) > +#define dprintk(a...) cpt_printk(1, "CPT DBG: " a) > diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h > new file mode 100644 > index 0000000..3d26229 > --- /dev/null > +++ b/cpt/cpt_image.h > @@ -0,0 +1,63 @@ > +/* > + * Copyright (C) 2008 Parallels, Inc. > + * > + * Author: Andrey Mirkin <major@openvz.org> > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License as > + * published by the Free Software Foundation, version 2 of the > + * License. > + * > + */ > + > +#ifndef __CPT_IMAGE_H_ > +#define __CPT_IMAGE_H_ 1 > + > +enum _cpt_object_type > +{ > + CPT_OBJ_TASK = 0, > + CPT_OBJ_MAX, > + /* The objects above are stored in memory while checkpointing */ > + > + CPT_OBJ_HEAD = 1024, > +}; > + > +enum _cpt_content_type { > + CPT_CONTENT_VOID, > + CPT_CONTENT_ARRAY, > + CPT_CONTENT_DATA, > + CPT_CONTENT_NAME, > + CPT_CONTENT_REF, > + CPT_CONTENT_MAX > +}; > + > +#define CPT_SIGNATURE0 0x79 > +#define CPT_SIGNATURE1 0x1c > +#define CPT_SIGNATURE2 0x01 > +#define CPT_SIGNATURE3 0x63 > + > +struct cpt_head > +{ > + __u8 cpt_signature[4]; /* Magic number */ > + __u32 cpt_hdrlen; /* Header length */ > + __u16 cpt_image_major; /* Format of this file */ > + __u16 cpt_image_minor; /* Format of this file */ > + __u16 cpt_image_sublevel; /* Format of this file */ > + __u16 cpt_image_extra; /* Format of this file */ > + __u16 cpt_arch; /* Architecture */ > + __u16 cpt_pad1; > + __u32 cpt_pad2; > +#define CPT_ARCH_I386 0 > + __u64 cpt_time; /* Time */ > +} __attribute__ ((aligned (8))); > + > +/* Common object header. */ > +struct cpt_object_hdr > +{ > + __u64 cpt_len; /* Size of current chunk of data */ > + __u16 cpt_type; /* Type of object */ > + __u32 cpt_hdrlen; /* Size of header */ > + __u16 cpt_content; /* Content type: array, reference... */ > +} __attribute__ ((aligned (8))); > + > +#endif /* __CPT_IMAGE_H_ */ > diff --git a/cpt/sys.c b/cpt/sys.c > index 4051286..8334c4c 100644 > --- a/cpt/sys.c > +++ b/cpt/sys.c > @@ -13,21 +13,142 @@ > #include <linux/sched.h> > #include <linux/fs.h> > #include <linux/file.h> > -#include <linux/notifier.h> > +#include <linux/uaccess.h> > #include <linux/module.h> > > #include "cpt.h" > +#include "cpt_image.h" > > MODULE_LICENSE("GPL"); > > +/* Debug level, constant for now */ > +int debug_level = 1; > + > +static int file_write(const void *addr, size_t count, struct cpt_context *ctx) > +{ > + mm_segment_t oldfs; > + ssize_t err = -EBADF; > + struct file *file = ctx->file; > + > + oldfs = get_fs(); set_fs(KERNEL_DS); > + if (file) > + err = file->f_op->write(file, addr, count, &file->f_pos); > + set_fs(oldfs); > + if (err != count) > + return err >= 0 ? -EIO : err; > + return 0; > +} > + > +static int file_read(void *addr, size_t count, struct cpt_context *ctx) > +{ > + mm_segment_t oldfs; > + ssize_t err = -EBADF; > + struct file *file = ctx->file; > + > + oldfs = get_fs(); set_fs(KERNEL_DS); > + if (file) > + err = file->f_op->read(file, addr, count, &file->f_pos); > + set_fs(oldfs); > + if (err != count) > + return err >= 0 ? -EIO : err; > + return 0; > +} > + > +struct cpt_context * context_alloc(void) > +{ > + struct cpt_context *ctx; > + int i; > + > + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); > + if (!ctx) > + return NULL; > + > + init_MUTEX(&ctx->main_sem); > + ctx->refcount = 1; > + > + ctx->current_object = -1; > + ctx->write = file_write; > + ctx->read = file_read; > + for (i = 0; i < CPT_OBJ_MAX; i++) { > + INIT_LIST_HEAD(&ctx->object_array[i]); > + } > + > + return ctx; > +} > + > +void context_release(struct cpt_context *ctx) > +{ > + ctx->ctx_state = CPT_CTX_ERROR; > + > + if (ctx->file) > + fput(ctx->file); > + kfree(ctx); > +} > + > +static void context_put(struct cpt_context *ctx) > +{ > + if (!--ctx->refcount) > + context_release(ctx); > +} > + > static int checkpoint(pid_t pid, int fd, unsigned long flags) > { > - return -ENOSYS; > + struct file *file; > + struct cpt_context *ctx; > + int err; > + > + err = -EBADF; > + file = fget(fd); > + if (!file) > + goto out; > + > + err = -ENOMEM; > + ctx = context_alloc(); > + if (!ctx) > + goto out_file; > + > + ctx->file = file; > + ctx->ctx_state = CPT_CTX_DUMPING; > + > + /* checkpoint */ > + err = -ENOSYS; > + > + context_put(ctx); > + > +out_file: > + fput(file); > +out: > + return err; > } > > static int restart(int ctid, int fd, unsigned long flags) > { > - return -ENOSYS; > + struct file *file; > + struct cpt_context *ctx; > + int err; > + > + err = -EBADF; > + file = fget(fd); > + if (!file) > + goto out; > + > + err = -ENOMEM; > + ctx = context_alloc(); > + if (!ctx) > + goto out_file; > + > + ctx->file = file; > + ctx->ctx_state = CPT_CTX_UNDUMPING; > + > + /* restart */ > + err = -ENOSYS; > + > + context_put(ctx); > + > +out_file: > + fput(file); > +out: > + return err; > } > > static int __init init_cptrst(void) > -- Matthieu Fertré IRISA, Campus universitaire de Beaulieu - 35 042 Rennes, France tel: +33 2 99 84 22 74 fax: +33 2 99 84 71 71 mail: mfertre@irisa.fr ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 3/9] Introduce context structure needed during checkpointing/restart 2008-09-03 12:29 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Matthieu Fertré @ 2008-09-03 14:11 ` Andrey Mirkin 0 siblings, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 14:11 UTC (permalink / raw) To: Matthieu Fertré; +Cc: linux-kernel, containers On Wednesday 03 September 2008 16:29 Matthieu Fertré wrote: > Andrey Mirkin a écrit : > > Add functions for context allocation/destroy. > > Introduce functions to read/write image. > > Introduce image header and object header. > > > > Signed-off-by: Andrey Mirkin <major@openvz.org> > > --- > > cpt/cpt.h | 37 ++++++++++++++++ > > cpt/cpt_image.h | 63 +++++++++++++++++++++++++++ > > cpt/sys.c | 127 > > +++++++++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, > > 224 insertions(+), 3 deletions(-) > > create mode 100644 cpt/cpt_image.h > > > > diff --git a/cpt/cpt.h b/cpt/cpt.h > > index 381a9bf..607ac1b 100644 > > --- a/cpt/cpt.h > > +++ b/cpt/cpt.h > > @@ -10,6 +10,8 @@ > > * > > */ > > > > +#include "cpt_image.h" > > + > > struct cpt_operations > > { > > struct module * owner; > > @@ -17,3 +19,38 @@ struct cpt_operations > > int (*restart)(int ctid, int fd, unsigned long flags); > > }; > > extern struct cpt_operations cpt_ops; > > + > > +#define CPT_CTX_ERROR -1 > > +#define CPT_CTX_IDLE 0 > > +#define CPT_CTX_DUMPING 1 > > +#define CPT_CTX_UNDUMPING 2 > > Maybe you can define an enum here instead. > > > + > > +typedef struct cpt_context > > +{ > > + pid_t pid; /* should be changed to ctid later */ > > + int ctx_id; /* context id */ > > + struct list_head ctx_list; > > + int refcount; > > + int ctx_state; > > It will be more clear that ctx_state refers to it. Thanks for the idea with enum, I'll fix it in next version. Regards, Andrey > > > + struct semaphore main_sem; > > + > > + int errno; > > + > > + struct file *file; > > + loff_t current_object; > > + > > + struct list_head object_array[CPT_OBJ_MAX]; > > + > > + int (*write)(const void *addr, size_t count, struct cpt_context *ctx); > > + int (*read)(void *addr, size_t count, struct cpt_context *ctx); > > +} cpt_context_t; > > + > > +extern int debug_level; > > + > > +#define cpt_printk(lvl, fmt, args...) do { \ > > + if (lvl <= debug_level) \ > > + printk(fmt, ##args); \ > > + } while (0) > > + > > +#define eprintk(a...) cpt_printk(1, "CPT ERR: " a) > > +#define dprintk(a...) cpt_printk(1, "CPT DBG: " a) > > diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h > > new file mode 100644 > > index 0000000..3d26229 > > --- /dev/null > > +++ b/cpt/cpt_image.h > > @@ -0,0 +1,63 @@ > > +/* > > + * Copyright (C) 2008 Parallels, Inc. > > + * > > + * Author: Andrey Mirkin <major@openvz.org> > > + * > > + * This program is free software; you can redistribute it and/or > > + * modify it under the terms of the GNU General Public License as > > + * published by the Free Software Foundation, version 2 of the > > + * License. > > + * > > + */ > > + > > +#ifndef __CPT_IMAGE_H_ > > +#define __CPT_IMAGE_H_ 1 > > + > > +enum _cpt_object_type > > +{ > > + CPT_OBJ_TASK = 0, > > + CPT_OBJ_MAX, > > + /* The objects above are stored in memory while checkpointing */ > > + > > + CPT_OBJ_HEAD = 1024, > > +}; > > + > > +enum _cpt_content_type { > > + CPT_CONTENT_VOID, > > + CPT_CONTENT_ARRAY, > > + CPT_CONTENT_DATA, > > + CPT_CONTENT_NAME, > > + CPT_CONTENT_REF, > > + CPT_CONTENT_MAX > > +}; > > + > > +#define CPT_SIGNATURE0 0x79 > > +#define CPT_SIGNATURE1 0x1c > > +#define CPT_SIGNATURE2 0x01 > > +#define CPT_SIGNATURE3 0x63 > > + > > +struct cpt_head > > +{ > > + __u8 cpt_signature[4]; /* Magic number */ > > + __u32 cpt_hdrlen; /* Header length */ > > + __u16 cpt_image_major; /* Format of this file */ > > + __u16 cpt_image_minor; /* Format of this file */ > > + __u16 cpt_image_sublevel; /* Format of this file */ > > + __u16 cpt_image_extra; /* Format of this file */ > > + __u16 cpt_arch; /* Architecture */ > > + __u16 cpt_pad1; > > + __u32 cpt_pad2; > > +#define CPT_ARCH_I386 0 > > + __u64 cpt_time; /* Time */ > > +} __attribute__ ((aligned (8))); > > + > > +/* Common object header. */ > > +struct cpt_object_hdr > > +{ > > + __u64 cpt_len; /* Size of current chunk of data */ > > + __u16 cpt_type; /* Type of object */ > > + __u32 cpt_hdrlen; /* Size of header */ > > + __u16 cpt_content; /* Content type: array, reference... */ > > +} __attribute__ ((aligned (8))); > > + > > +#endif /* __CPT_IMAGE_H_ */ > > diff --git a/cpt/sys.c b/cpt/sys.c > > index 4051286..8334c4c 100644 > > --- a/cpt/sys.c > > +++ b/cpt/sys.c > > @@ -13,21 +13,142 @@ > > #include <linux/sched.h> > > #include <linux/fs.h> > > #include <linux/file.h> > > -#include <linux/notifier.h> > > +#include <linux/uaccess.h> > > #include <linux/module.h> > > > > #include "cpt.h" > > +#include "cpt_image.h" > > > > MODULE_LICENSE("GPL"); > > > > +/* Debug level, constant for now */ > > +int debug_level = 1; > > + > > +static int file_write(const void *addr, size_t count, struct cpt_context > > *ctx) +{ > > + mm_segment_t oldfs; > > + ssize_t err = -EBADF; > > + struct file *file = ctx->file; > > + > > + oldfs = get_fs(); set_fs(KERNEL_DS); > > + if (file) > > + err = file->f_op->write(file, addr, count, &file->f_pos); > > + set_fs(oldfs); > > + if (err != count) > > + return err >= 0 ? -EIO : err; > > + return 0; > > +} > > + > > +static int file_read(void *addr, size_t count, struct cpt_context *ctx) > > +{ > > + mm_segment_t oldfs; > > + ssize_t err = -EBADF; > > + struct file *file = ctx->file; > > + > > + oldfs = get_fs(); set_fs(KERNEL_DS); > > + if (file) > > + err = file->f_op->read(file, addr, count, &file->f_pos); > > + set_fs(oldfs); > > + if (err != count) > > + return err >= 0 ? -EIO : err; > > + return 0; > > +} > > + > > +struct cpt_context * context_alloc(void) > > +{ > > + struct cpt_context *ctx; > > + int i; > > + > > + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); > > + if (!ctx) > > + return NULL; > > + > > + init_MUTEX(&ctx->main_sem); > > + ctx->refcount = 1; > > + > > + ctx->current_object = -1; > > + ctx->write = file_write; > > + ctx->read = file_read; > > + for (i = 0; i < CPT_OBJ_MAX; i++) { > > + INIT_LIST_HEAD(&ctx->object_array[i]); > > + } > > + > > + return ctx; > > +} > > + > > +void context_release(struct cpt_context *ctx) > > +{ > > + ctx->ctx_state = CPT_CTX_ERROR; > > + > > + if (ctx->file) > > + fput(ctx->file); > > + kfree(ctx); > > +} > > + > > +static void context_put(struct cpt_context *ctx) > > +{ > > + if (!--ctx->refcount) > > + context_release(ctx); > > +} > > + > > static int checkpoint(pid_t pid, int fd, unsigned long flags) > > { > > - return -ENOSYS; > > + struct file *file; > > + struct cpt_context *ctx; > > + int err; > > + > > + err = -EBADF; > > + file = fget(fd); > > + if (!file) > > + goto out; > > + > > + err = -ENOMEM; > > + ctx = context_alloc(); > > + if (!ctx) > > + goto out_file; > > + > > + ctx->file = file; > > + ctx->ctx_state = CPT_CTX_DUMPING; > > + > > + /* checkpoint */ > > + err = -ENOSYS; > > + > > + context_put(ctx); > > + > > +out_file: > > + fput(file); > > +out: > > + return err; > > } > > > > static int restart(int ctid, int fd, unsigned long flags) > > { > > - return -ENOSYS; > > + struct file *file; > > + struct cpt_context *ctx; > > + int err; > > + > > + err = -EBADF; > > + file = fget(fd); > > + if (!file) > > + goto out; > > + > > + err = -ENOMEM; > > + ctx = context_alloc(); > > + if (!ctx) > > + goto out_file; > > + > > + ctx->file = file; > > + ctx->ctx_state = CPT_CTX_UNDUMPING; > > + > > + /* restart */ > > + err = -ENOSYS; > > + > > + context_put(ctx); > > + > > +out_file: > > + fput(file); > > +out: > > + return err; > > } > > > > static int __init init_cptrst(void) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 3/9] Introduce context structure needed during checkpointing/restart 2008-09-03 10:57 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Andrey Mirkin 2008-09-03 10:57 ` [PATCH 4/9] Introduce container dump function Andrey Mirkin 2008-09-03 12:29 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Matthieu Fertré @ 2008-09-03 13:56 ` Louis Rilling 2008-09-03 14:07 ` Andrey Mirkin 2008-09-03 14:13 ` Cedric Le Goater 3 siblings, 1 reply; 69+ messages in thread From: Louis Rilling @ 2008-09-03 13:56 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers [-- Attachment #1: Type: text/plain, Size: 2452 bytes --] On Wed, Sep 03, 2008 at 02:57:50PM +0400, Andrey Mirkin wrote: > Add functions for context allocation/destroy. > Introduce functions to read/write image. > Introduce image header and object header. > [...] > diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h > new file mode 100644 > index 0000000..3d26229 > --- /dev/null > +++ b/cpt/cpt_image.h > @@ -0,0 +1,63 @@ > +/* > + * Copyright (C) 2008 Parallels, Inc. > + * > + * Author: Andrey Mirkin <major@openvz.org> > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License as > + * published by the Free Software Foundation, version 2 of the > + * License. > + * > + */ > + > +#ifndef __CPT_IMAGE_H_ > +#define __CPT_IMAGE_H_ 1 > + > +enum _cpt_object_type > +{ > + CPT_OBJ_TASK = 0, > + CPT_OBJ_MAX, > + /* The objects above are stored in memory while checkpointing */ > + > + CPT_OBJ_HEAD = 1024, > +}; > + > +enum _cpt_content_type { > + CPT_CONTENT_VOID, > + CPT_CONTENT_ARRAY, > + CPT_CONTENT_DATA, > + CPT_CONTENT_NAME, > + CPT_CONTENT_REF, > + CPT_CONTENT_MAX > +}; > + > +#define CPT_SIGNATURE0 0x79 > +#define CPT_SIGNATURE1 0x1c > +#define CPT_SIGNATURE2 0x01 > +#define CPT_SIGNATURE3 0x63 > + > +struct cpt_head > +{ > + __u8 cpt_signature[4]; /* Magic number */ > + __u32 cpt_hdrlen; /* Header length */ > + __u16 cpt_image_major; /* Format of this file */ > + __u16 cpt_image_minor; /* Format of this file */ > + __u16 cpt_image_sublevel; /* Format of this file */ > + __u16 cpt_image_extra; /* Format of this file */ > + __u16 cpt_arch; /* Architecture */ > + __u16 cpt_pad1; > + __u32 cpt_pad2; > +#define CPT_ARCH_I386 0 Why is this constant precisely defined after the padding? > + __u64 cpt_time; /* Time */ > +} __attribute__ ((aligned (8))); > + > +/* Common object header. */ > +struct cpt_object_hdr > +{ > + __u64 cpt_len; /* Size of current chunk of data */ > + __u16 cpt_type; /* Type of object */ > + __u32 cpt_hdrlen; /* Size of header */ > + __u16 cpt_content; /* Content type: array, reference... */ This layout looks a bit awkward for 32bits/64bits compatibility. Maybe put cpt_hdrlen before cpt_type? Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 3/9] Introduce context structure needed during checkpointing/restart 2008-09-03 13:56 ` Louis Rilling @ 2008-09-03 14:07 ` Andrey Mirkin 0 siblings, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 14:07 UTC (permalink / raw) To: Louis.Rilling; +Cc: linux-kernel, containers On Wednesday 03 September 2008 17:56 Louis Rilling wrote: > On Wed, Sep 03, 2008 at 02:57:50PM +0400, Andrey Mirkin wrote: > > Add functions for context allocation/destroy. > > Introduce functions to read/write image. > > Introduce image header and object header. > > [...] > > > diff --git a/cpt/cpt_image.h b/cpt/cpt_image.h > > new file mode 100644 > > index 0000000..3d26229 > > --- /dev/null > > +++ b/cpt/cpt_image.h > > @@ -0,0 +1,63 @@ > > +/* > > + * Copyright (C) 2008 Parallels, Inc. > > + * > > + * Author: Andrey Mirkin <major@openvz.org> > > + * > > + * This program is free software; you can redistribute it and/or > > + * modify it under the terms of the GNU General Public License as > > + * published by the Free Software Foundation, version 2 of the > > + * License. > > + * > > + */ > > + > > +#ifndef __CPT_IMAGE_H_ > > +#define __CPT_IMAGE_H_ 1 > > + > > +enum _cpt_object_type > > +{ > > + CPT_OBJ_TASK = 0, > > + CPT_OBJ_MAX, > > + /* The objects above are stored in memory while checkpointing */ > > + > > + CPT_OBJ_HEAD = 1024, > > +}; > > + > > +enum _cpt_content_type { > > + CPT_CONTENT_VOID, > > + CPT_CONTENT_ARRAY, > > + CPT_CONTENT_DATA, > > + CPT_CONTENT_NAME, > > + CPT_CONTENT_REF, > > + CPT_CONTENT_MAX > > +}; > > + > > +#define CPT_SIGNATURE0 0x79 > > +#define CPT_SIGNATURE1 0x1c > > +#define CPT_SIGNATURE2 0x01 > > +#define CPT_SIGNATURE3 0x63 > > + > > +struct cpt_head > > +{ > > + __u8 cpt_signature[4]; /* Magic number */ > > + __u32 cpt_hdrlen; /* Header length */ > > + __u16 cpt_image_major; /* Format of this file */ > > + __u16 cpt_image_minor; /* Format of this file */ > > + __u16 cpt_image_sublevel; /* Format of this file */ > > + __u16 cpt_image_extra; /* Format of this file */ > > + __u16 cpt_arch; /* Architecture */ > > + __u16 cpt_pad1; > > + __u32 cpt_pad2; > > +#define CPT_ARCH_I386 0 > > Why is this constant precisely defined after the padding? I will move it right after cpt_arch. Thanks. > > > + __u64 cpt_time; /* Time */ > > +} __attribute__ ((aligned (8))); > > + > > +/* Common object header. */ > > +struct cpt_object_hdr > > +{ > > + __u64 cpt_len; /* Size of current chunk of data */ > > + __u16 cpt_type; /* Type of object */ > > + __u32 cpt_hdrlen; /* Size of header */ > > + __u16 cpt_content; /* Content type: array, reference... */ > > This layout looks a bit awkward for 32bits/64bits compatibility. Maybe put > cpt_hdrlen before cpt_type? You right here. Thanks for the catch. I will change that in next version. Thanks, Andrey ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 3/9] Introduce context structure needed during checkpointing/restart 2008-09-03 10:57 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Andrey Mirkin ` (2 preceding siblings ...) 2008-09-03 13:56 ` Louis Rilling @ 2008-09-03 14:13 ` Cedric Le Goater 2008-09-03 14:29 ` Andrey Mirkin 3 siblings, 1 reply; 69+ messages in thread From: Cedric Le Goater @ 2008-09-03 14:13 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers > +void context_release(struct cpt_context *ctx) > +{ > + ctx->ctx_state = CPT_CTX_ERROR; > + > + if (ctx->file) > + fput(ctx->file); > + kfree(ctx); > +} > + > +static void context_put(struct cpt_context *ctx) > +{ > + if (!--ctx->refcount) > + context_release(ctx); > +} > + > static int checkpoint(pid_t pid, int fd, unsigned long flags) > { > - return -ENOSYS; > + struct file *file; > + struct cpt_context *ctx; > + int err; > + > + err = -EBADF; > + file = fget(fd); > + if (!file) > + goto out; > + > + err = -ENOMEM; > + ctx = context_alloc(); > + if (!ctx) > + goto out_file; > + > + ctx->file = file; > + ctx->ctx_state = CPT_CTX_DUMPING; > + > + /* checkpoint */ > + err = -ENOSYS; > + > + context_put(ctx); > + > +out_file: > + fput(file); > +out: > + return err; > } it looks like fput(file) is done twice in checkpoint() and context_release() ? C. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 3/9] Introduce context structure needed during checkpointing/restart 2008-09-03 14:13 ` Cedric Le Goater @ 2008-09-03 14:29 ` Andrey Mirkin 0 siblings, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 14:29 UTC (permalink / raw) To: Cedric Le Goater; +Cc: linux-kernel, containers On Wednesday 03 September 2008 18:13 Cedric Le Goater wrote: > > +void context_release(struct cpt_context *ctx) > > +{ > > + ctx->ctx_state = CPT_CTX_ERROR; > > + > > + if (ctx->file) > > + fput(ctx->file); > > + kfree(ctx); > > +} > > + > > +static void context_put(struct cpt_context *ctx) > > +{ > > + if (!--ctx->refcount) > > + context_release(ctx); > > +} > > + > > static int checkpoint(pid_t pid, int fd, unsigned long flags) > > { > > - return -ENOSYS; > > + struct file *file; > > + struct cpt_context *ctx; > > + int err; > > + > > + err = -EBADF; > > + file = fget(fd); > > + if (!file) > > + goto out; > > + > > + err = -ENOMEM; > > + ctx = context_alloc(); > > + if (!ctx) > > + goto out_file; > > + > > + ctx->file = file; > > + ctx->ctx_state = CPT_CTX_DUMPING; > > + > > + /* checkpoint */ > > + err = -ENOSYS; > > + > > + context_put(ctx); > > + > > +out_file: > > + fput(file); > > +out: > > + return err; > > } > > it looks like fput(file) is done twice in checkpoint() and > context_release() ? Oh, you are right. I'm sorry, I was in rush and sent an outdated version. Will resend correct patch shortly. Thanks, Andrey ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 2/9] Make checkpoint/restart functionality modular 2008-09-03 10:57 ` [PATCH 2/9] Make checkpoint/restart functionality modular Andrey Mirkin 2008-09-03 10:57 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Andrey Mirkin @ 2008-09-03 14:27 ` Serge E. Hallyn 2008-09-03 14:51 ` Andrey Mirkin 1 sibling, 1 reply; 69+ messages in thread From: Serge E. Hallyn @ 2008-09-03 14:27 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers Quoting Andrey Mirkin (major@openvz.org): > A config option CONFIG_CHECKPOINT is introduced. > New structure cpt_operations is introduced to store pointers to > checkpoint/restart functions from module. Thanks Andrey. The structure of this patchset is very nice, and I'm finding it very easy to read, at least the first half. The differences (in yours and Oren's) between checkpointing and restarting yourself or another task should probably not be important as people will want to be able to do both in the end, and I'm sure both patchsets can do both. And you have things like your own printk helpers in patch 3 that, just as in Oren's set, will have to go away. But I'm not yet able to see what the real differences are. Keeping on looking... > Signed-off-by: Andrey Mirkin <major@openvz.org> > --- > cpt/Kconfig | 7 +++++++ > cpt/Makefile | 4 ++++ > cpt/cpt.h | 19 +++++++++++++++++++ > cpt/sys.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ > cpt/sys_core.c | 29 +++++++++++++++++++++++++++-- > init/Kconfig | 2 ++ > 6 files changed, 107 insertions(+), 2 deletions(-) > create mode 100644 cpt/Kconfig > create mode 100644 cpt/cpt.h > create mode 100644 cpt/sys.c > > diff --git a/cpt/Kconfig b/cpt/Kconfig > new file mode 100644 > index 0000000..b9bc72d > --- /dev/null > +++ b/cpt/Kconfig > @@ -0,0 +1,7 @@ > +config CHECKPOINT > + tristate "Checkpoint & restart for containers" > + depends on EXPERIMENTAL > + default n > + help > + This option adds module "cptrst", which allow to save a running > + container to a file and restart it later using this image file. > diff --git a/cpt/Makefile b/cpt/Makefile > index 2276fb1..bfe75d5 100644 > --- a/cpt/Makefile > +++ b/cpt/Makefile > @@ -1 +1,5 @@ > obj-y += sys_core.o > + > +obj-$(CONFIG_CHECKPOINT) += cptrst.o > + > +cptrst-objs := sys.o > diff --git a/cpt/cpt.h b/cpt/cpt.h > new file mode 100644 > index 0000000..381a9bf > --- /dev/null > +++ b/cpt/cpt.h > @@ -0,0 +1,19 @@ > +/* > + * Copyright (C) 2008 Parallels, Inc. > + * > + * Author: Andrey Mirkin <major@openvz.org> > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License as > + * published by the Free Software Foundation, version 2 of the > + * License. > + * > + */ > + > +struct cpt_operations > +{ > + struct module * owner; > + int (*checkpoint)(pid_t pid, int fd, unsigned long flags); > + int (*restart)(int ctid, int fd, unsigned long flags); > +}; > +extern struct cpt_operations cpt_ops; > diff --git a/cpt/sys.c b/cpt/sys.c > new file mode 100644 > index 0000000..4051286 > --- /dev/null > +++ b/cpt/sys.c > @@ -0,0 +1,48 @@ > +/* > + * Copyright (C) 2008 Parallels, Inc. > + * > + * Author: Andrey Mirkin <major@openvz.org> > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License as > + * published by the Free Software Foundation, version 2 of the > + * License. > + * > + */ > + > +#include <linux/sched.h> > +#include <linux/fs.h> > +#include <linux/file.h> > +#include <linux/notifier.h> > +#include <linux/module.h> > + > +#include "cpt.h" > + > +MODULE_LICENSE("GPL"); > + > +static int checkpoint(pid_t pid, int fd, unsigned long flags) > +{ > + return -ENOSYS; > +} > + > +static int restart(int ctid, int fd, unsigned long flags) > +{ > + return -ENOSYS; > +} > + > +static int __init init_cptrst(void) > +{ > + cpt_ops.owner = THIS_MODULE; > + cpt_ops.checkpoint = checkpoint; > + cpt_ops.restart = restart; > + return 0; > +} > +module_init(init_cptrst); > + > +static void __exit exit_cptrst(void) > +{ > + cpt_ops.checkpoint = NULL; > + cpt_ops.restart = NULL; > + cpt_ops.owner = NULL; > +} > +module_exit(exit_cptrst); > diff --git a/cpt/sys_core.c b/cpt/sys_core.c > index 1a97fb6..5dd1191 100644 > --- a/cpt/sys_core.c > +++ b/cpt/sys_core.c > @@ -13,6 +13,13 @@ > #include <linux/sched.h> > #include <linux/fs.h> > #include <linux/file.h> > +#include <linux/notifier.h> > +#include <linux/module.h> > + > +#include "cpt.h" > + > +struct cpt_operations cpt_ops = { NULL, NULL, NULL }; > +EXPORT_SYMBOL(cpt_ops); > > /** > * sys_checkpoint - checkpoint a container from outside > @@ -23,7 +30,16 @@ > */ > asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags) > { > - return -ENOSYS; > + int ret; > + > + ret = -ENOSYS; > + > + if (try_module_get(cpt_ops.owner)) { > + if (cpt_ops.checkpoint) > + ret = cpt_ops.checkpoint(pid, fd, flags); > + module_put(cpt_ops.owner); > + } > + return ret; > } > > /** > @@ -34,5 +50,14 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags) > */ > asmlinkage long sys_restart(int ctid, int fd, unsigned long flags) > { > - return -ENOSYS; > + int ret; > + > + ret = -ENOSYS; > + > + if (try_module_get(cpt_ops.owner)) { > + if (cpt_ops.restart) > + ret = cpt_ops.restart(ctid, fd, flags); > + module_put(cpt_ops.owner); > + } > + return ret; > } > diff --git a/init/Kconfig b/init/Kconfig > index 4bd4b0c..d29ed21 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -344,6 +344,8 @@ config CGROUP_FREEZER > Provides a way to freeze and unfreeze all tasks in a > cgroup > > +source "cpt/Kconfig" > + > config FAIR_GROUP_SCHED > bool "Group scheduling for SCHED_OTHER" > depends on GROUP_SCHED > -- > 1.5.6 > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 2/9] Make checkpoint/restart functionality modular 2008-09-03 14:27 ` [PATCH 2/9] Make checkpoint/restart functionality modular Serge E. Hallyn @ 2008-09-03 14:51 ` Andrey Mirkin 0 siblings, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 14:51 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: linux-kernel, containers On Wednesday 03 September 2008 18:27 Serge E. Hallyn wrote: > Quoting Andrey Mirkin (major@openvz.org): > > A config option CONFIG_CHECKPOINT is introduced. > > New structure cpt_operations is introduced to store pointers to > > checkpoint/restart functions from module. > > Thanks Andrey. The structure of this patchset is very nice, and I'm > finding it very easy to read, at least the first half. > > The differences (in yours and Oren's) between checkpointing and > restarting yourself or another task should probably not be important > as people will want to be able to do both in the end, and I'm sure > both patchsets can do both. And you have things like your own > printk helpers in patch 3 that, just as in Oren's set, will have to > go away. But I'm not yet able to see what the real differences are. > > Keeping on looking... Thanks for reviewing. My owen printk is just for debug purpose, I will change them to something already existing. I'm waiting for more comments from you. Thanks, Andrey > > > Signed-off-by: Andrey Mirkin <major@openvz.org> > > --- > > cpt/Kconfig | 7 +++++++ > > cpt/Makefile | 4 ++++ > > cpt/cpt.h | 19 +++++++++++++++++++ > > cpt/sys.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++ > > cpt/sys_core.c | 29 +++++++++++++++++++++++++++-- > > init/Kconfig | 2 ++ > > 6 files changed, 107 insertions(+), 2 deletions(-) > > create mode 100644 cpt/Kconfig > > create mode 100644 cpt/cpt.h > > create mode 100644 cpt/sys.c > > > > diff --git a/cpt/Kconfig b/cpt/Kconfig > > new file mode 100644 > > index 0000000..b9bc72d > > --- /dev/null > > +++ b/cpt/Kconfig > > @@ -0,0 +1,7 @@ > > +config CHECKPOINT > > + tristate "Checkpoint & restart for containers" > > + depends on EXPERIMENTAL > > + default n > > + help > > + This option adds module "cptrst", which allow to save a running > > + container to a file and restart it later using this image file. > > diff --git a/cpt/Makefile b/cpt/Makefile > > index 2276fb1..bfe75d5 100644 > > --- a/cpt/Makefile > > +++ b/cpt/Makefile > > @@ -1 +1,5 @@ > > obj-y += sys_core.o > > + > > +obj-$(CONFIG_CHECKPOINT) += cptrst.o > > + > > +cptrst-objs := sys.o > > diff --git a/cpt/cpt.h b/cpt/cpt.h > > new file mode 100644 > > index 0000000..381a9bf > > --- /dev/null > > +++ b/cpt/cpt.h > > @@ -0,0 +1,19 @@ > > +/* > > + * Copyright (C) 2008 Parallels, Inc. > > + * > > + * Author: Andrey Mirkin <major@openvz.org> > > + * > > + * This program is free software; you can redistribute it and/or > > + * modify it under the terms of the GNU General Public License as > > + * published by the Free Software Foundation, version 2 of the > > + * License. > > + * > > + */ > > + > > +struct cpt_operations > > +{ > > + struct module * owner; > > + int (*checkpoint)(pid_t pid, int fd, unsigned long flags); > > + int (*restart)(int ctid, int fd, unsigned long flags); > > +}; > > +extern struct cpt_operations cpt_ops; > > diff --git a/cpt/sys.c b/cpt/sys.c > > new file mode 100644 > > index 0000000..4051286 > > --- /dev/null > > +++ b/cpt/sys.c > > @@ -0,0 +1,48 @@ > > +/* > > + * Copyright (C) 2008 Parallels, Inc. > > + * > > + * Author: Andrey Mirkin <major@openvz.org> > > + * > > + * This program is free software; you can redistribute it and/or > > + * modify it under the terms of the GNU General Public License as > > + * published by the Free Software Foundation, version 2 of the > > + * License. > > + * > > + */ > > + > > +#include <linux/sched.h> > > +#include <linux/fs.h> > > +#include <linux/file.h> > > +#include <linux/notifier.h> > > +#include <linux/module.h> > > + > > +#include "cpt.h" > > + > > +MODULE_LICENSE("GPL"); > > + > > +static int checkpoint(pid_t pid, int fd, unsigned long flags) > > +{ > > + return -ENOSYS; > > +} > > + > > +static int restart(int ctid, int fd, unsigned long flags) > > +{ > > + return -ENOSYS; > > +} > > + > > +static int __init init_cptrst(void) > > +{ > > + cpt_ops.owner = THIS_MODULE; > > + cpt_ops.checkpoint = checkpoint; > > + cpt_ops.restart = restart; > > + return 0; > > +} > > +module_init(init_cptrst); > > + > > +static void __exit exit_cptrst(void) > > +{ > > + cpt_ops.checkpoint = NULL; > > + cpt_ops.restart = NULL; > > + cpt_ops.owner = NULL; > > +} > > +module_exit(exit_cptrst); > > diff --git a/cpt/sys_core.c b/cpt/sys_core.c > > index 1a97fb6..5dd1191 100644 > > --- a/cpt/sys_core.c > > +++ b/cpt/sys_core.c > > @@ -13,6 +13,13 @@ > > #include <linux/sched.h> > > #include <linux/fs.h> > > #include <linux/file.h> > > +#include <linux/notifier.h> > > +#include <linux/module.h> > > + > > +#include "cpt.h" > > + > > +struct cpt_operations cpt_ops = { NULL, NULL, NULL }; > > +EXPORT_SYMBOL(cpt_ops); > > > > /** > > * sys_checkpoint - checkpoint a container from outside > > @@ -23,7 +30,16 @@ > > */ > > asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags) > > { > > - return -ENOSYS; > > + int ret; > > + > > + ret = -ENOSYS; > > + > > + if (try_module_get(cpt_ops.owner)) { > > + if (cpt_ops.checkpoint) > > + ret = cpt_ops.checkpoint(pid, fd, flags); > > + module_put(cpt_ops.owner); > > + } > > + return ret; > > } > > > > /** > > @@ -34,5 +50,14 @@ asmlinkage long sys_checkpoint(pid_t pid, int fd, > > unsigned long flags) */ > > asmlinkage long sys_restart(int ctid, int fd, unsigned long flags) > > { > > - return -ENOSYS; > > + int ret; > > + > > + ret = -ENOSYS; > > + > > + if (try_module_get(cpt_ops.owner)) { > > + if (cpt_ops.restart) > > + ret = cpt_ops.restart(ctid, fd, flags); > > + module_put(cpt_ops.owner); > > + } > > + return ret; > > } > > diff --git a/init/Kconfig b/init/Kconfig > > index 4bd4b0c..d29ed21 100644 > > --- a/init/Kconfig > > +++ b/init/Kconfig > > @@ -344,6 +344,8 @@ config CGROUP_FREEZER > > Provides a way to freeze and unfreeze all tasks in a > > cgroup > > > > +source "cpt/Kconfig" > > + > > config FAIR_GROUP_SCHED > > bool "Group scheduling for SCHED_OTHER" > > depends on GROUP_SCHED > > -- > > 1.5.6 > > > > _______________________________________________ > > Containers mailing list > > Containers@lists.linux-foundation.org > > https://lists.linux-foundation.org/mailman/listinfo/containers > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls 2008-09-03 10:57 ` [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls Andrey Mirkin 2008-09-03 10:57 ` [PATCH 2/9] Make checkpoint/restart functionality modular Andrey Mirkin @ 2008-09-03 11:44 ` Cedric Le Goater 2008-09-03 13:05 ` [Devel] " Andrey Mirkin 1 sibling, 1 reply; 69+ messages in thread From: Cedric Le Goater @ 2008-09-03 11:44 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers Andrey Mirkin wrote: > Right now they just return -ENOSYS. Later they will provide functionality > to checkpoint and restart a container. > > Both syscalls take as arguments a file descriptor and flags. > Also sys_checkpoint take as the first argument a PID of container's init > (later it will be container ID); sys_restart takes as the first argument > a container ID (right now it will not be used). > > Signed-off-by: Andrey Mirkin <major@openvz.org> > --- > Makefile | 2 +- > arch/x86/kernel/syscall_table_32.S | 2 + > cpt/Makefile | 1 + > cpt/sys_core.c | 38 ++++++++++++++++++++++++++++++++++++ > include/asm-x86/unistd_32.h | 2 + > 5 files changed, 44 insertions(+), 1 deletions(-) > create mode 100644 cpt/Makefile > create mode 100644 cpt/sys_core.c > > diff --git a/Makefile b/Makefile > index ea413fa..1dee5c0 100644 > --- a/Makefile > +++ b/Makefile > @@ -619,7 +619,7 @@ export mod_strip_cmd > > > ifeq ($(KBUILD_EXTMOD),) > -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ > +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ cpt/ > > vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ > $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ > diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S > index fd9d4f4..4a0d7fb 100644 > --- a/arch/x86/kernel/syscall_table_32.S > +++ b/arch/x86/kernel/syscall_table_32.S > @@ -333,3 +333,5 @@ ENTRY(sys_call_table) > .long sys_pipe2 > .long sys_inotify_init1 > .long sys_hijack > + .long sys_checkpoint > + .long sys_restart /* 335 */ this patchset is based on top of : git://git.kernel.org/pub/scm/linux/kernel/git/daveh/linux-2.6-lxc.git right ? C. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls 2008-09-03 11:44 ` [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls Cedric Le Goater @ 2008-09-03 13:05 ` Andrey Mirkin 0 siblings, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 13:05 UTC (permalink / raw) To: devel; +Cc: Cedric Le Goater, Andrey Mirkin, containers, linux-kernel On Wednesday 03 September 2008 15:44 Cedric Le Goater wrote: > Andrey Mirkin wrote: > > Right now they just return -ENOSYS. Later they will provide functionality > > to checkpoint and restart a container. > > > > Both syscalls take as arguments a file descriptor and flags. > > Also sys_checkpoint take as the first argument a PID of container's init > > (later it will be container ID); sys_restart takes as the first argument > > a container ID (right now it will not be used). > > > > Signed-off-by: Andrey Mirkin <major@openvz.org> > > --- > > Makefile | 2 +- > > arch/x86/kernel/syscall_table_32.S | 2 + > > cpt/Makefile | 1 + > > cpt/sys_core.c | 38 > > ++++++++++++++++++++++++++++++++++++ include/asm-x86/unistd_32.h | > > 2 + > > 5 files changed, 44 insertions(+), 1 deletions(-) > > create mode 100644 cpt/Makefile > > create mode 100644 cpt/sys_core.c > > > > diff --git a/Makefile b/Makefile > > index ea413fa..1dee5c0 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -619,7 +619,7 @@ export mod_strip_cmd > > > > > > ifeq ($(KBUILD_EXTMOD),) > > -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ > > +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ cpt/ > > > > vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ > > $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ > > diff --git a/arch/x86/kernel/syscall_table_32.S > > b/arch/x86/kernel/syscall_table_32.S index fd9d4f4..4a0d7fb 100644 > > --- a/arch/x86/kernel/syscall_table_32.S > > +++ b/arch/x86/kernel/syscall_table_32.S > > @@ -333,3 +333,5 @@ ENTRY(sys_call_table) > > .long sys_pipe2 > > .long sys_inotify_init1 > > .long sys_hijack > > + .long sys_checkpoint > > + .long sys_restart /* 335 */ > > this patchset is based on top of : > > git://git.kernel.org/pub/scm/linux/kernel/git/daveh/linux-2.6-lxc.git > > right ? Yes, it is based on Dave's git tree. Regards, Andrey ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 10:57 [PATCH 0/9] OpenVZ kernel based checkpointing/restart Andrey Mirkin 2008-09-03 10:57 ` [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls Andrey Mirkin @ 2008-09-03 12:28 ` Cedric Le Goater 2008-09-03 13:59 ` [Devel] " Andrey Mirkin 2008-09-03 14:18 ` Serge E. Hallyn 2008-09-03 13:49 ` Louis Rilling ` (3 subsequent siblings) 5 siblings, 2 replies; 69+ messages in thread From: Cedric Le Goater @ 2008-09-03 12:28 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers Andrey Mirkin wrote: > This patchset introduces kernel based checkpointing/restart as it is > implemented in OpenVZ project. This patchset has limited functionality and > are able to checkpoint/restart only single process. Recently Oren Laaden > sent another kernel based implementation of checkpoint/restart. The main > differences between this patchset and Oren's patchset are: > > * In this patchset checkpointing initiated not from the process > (right now we do not have a container, only namespaces), Oren's patchset > performs checkpointing from the process context. > > * Restart in this patchset is initiated from process, which restarts a new > process (in new namespaces) with saved state. Oren's patchset uses the same > process from which restart was initiated and restore saved state over it. > > * Checkpoint/restart functionality in this patchset is implemented as a kernel > module why ? Do we really think that C/R implementations will be so different that we will need C/R ops to support all of them ? I imagine that there could be different models : 1. brute force : dump it all and kill 2. incremental 3. live migration ... But I see all of them really tied to the kernel internals. The first issues I see with this direction are some EXPORT_SYMBOL() that would be useless without a module. > As checkpointing is initiated not from the process which state should be saved > we should freeze a process before saving its state. Right now Container Freezer > from Matt Helsley can be used for this. OK that's integrated and Daniel's tools : http://lxc.cvs.sourceforge.net/lxc/ one more reason to work on integration :) C. > This patchset introduce only a concept how kernel based checkpointing/restart > can be implemented and are able to checkpoint/restart only a single process > with simple VMAs. > > I've tried to split my patchset in small patches to make review more easier. > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 12:28 ` [PATCH 0/9] OpenVZ kernel based checkpointing/restart Cedric Le Goater @ 2008-09-03 13:59 ` Andrey Mirkin 2008-09-04 22:55 ` Dave Hansen 2008-09-03 14:18 ` Serge E. Hallyn 1 sibling, 1 reply; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 13:59 UTC (permalink / raw) To: devel; +Cc: Cedric Le Goater, Andrey Mirkin, containers, linux-kernel On Wednesday 03 September 2008 16:28 Cedric Le Goater wrote: > Andrey Mirkin wrote: > > This patchset introduces kernel based checkpointing/restart as it is > > implemented in OpenVZ project. This patchset has limited functionality > > and are able to checkpoint/restart only single process. Recently Oren > > Laaden sent another kernel based implementation of checkpoint/restart. > > The main differences between this patchset and Oren's patchset are: > > > > * In this patchset checkpointing initiated not from the process > > (right now we do not have a container, only namespaces), Oren's patchset > > performs checkpointing from the process context. > > > > * Restart in this patchset is initiated from process, which restarts a > > new process (in new namespaces) with saved state. Oren's patchset uses > > the same process from which restart was initiated and restore saved state > > over it. > > > > * Checkpoint/restart functionality in this patchset is implemented as a > > kernel module > > why ? Do we really think that C/R implementations will be so different that > we will need C/R ops to support all of them ? I imagine that there could be > different models : > > 1. brute force : dump it all and kill > 2. incremental > 3. live migration > ... > > But I see all of them really tied to the kernel internals. > > The first issues I see with this direction are some EXPORT_SYMBOL() that > would be useless without a module. Checkpoint/restart functionality is implemented as a kernel module to provide more flexibility during development process - you can easily unload module, fix what you need and load module again, you do not need to reboot you server in this case. We have discussed this question during Container's mini summit on OLS-2008 and we have found out that it will be convenient to have such an ability. > > > As checkpointing is initiated not from the process which state should be > > saved we should freeze a process before saving its state. Right now > > Container Freezer from Matt Helsley can be used for this. > > OK that's integrated and Daniel's tools : > > http://lxc.cvs.sourceforge.net/lxc/ > > one more reason to work on integration :) Yes, you're right :) Regards, Andrey > > > This patchset introduce only a concept how kernel based > > checkpointing/restart can be implemented and are able to > > checkpoint/restart only a single process with simple VMAs. > > > > I've tried to split my patchset in small patches to make review more > > easier. _______________________________________________ > > Containers mailing list > > Containers@lists.linux-foundation.org > > https://lists.linux-foundation.org/mailman/listinfo/containers > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers > > _______________________________________________ > Devel mailing list > Devel@openvz.org > https://openvz.org/mailman/listinfo/devel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 13:59 ` [Devel] " Andrey Mirkin @ 2008-09-04 22:55 ` Dave Hansen 0 siblings, 0 replies; 69+ messages in thread From: Dave Hansen @ 2008-09-04 22:55 UTC (permalink / raw) To: Andrey Mirkin; +Cc: devel, containers, Cedric Le Goater, linux-kernel On Wed, 2008-09-03 at 17:59 +0400, Andrey Mirkin wrote: > > The first issues I see with this direction are some EXPORT_SYMBOL() that > > would be useless without a module. > > Checkpoint/restart functionality is implemented as a kernel module to provide > more flexibility during development process - you can easily unload module, > fix what you need and load module again, you do not need to reboot you server > in this case. We have discussed this question during Container's mini summit > on OLS-2008 and we have found out that it will be convenient to have such an > ability. It would probably be best to separate out the bits that are needed to make this feature build as a module. Since, as you say, this is only useful during development, we'll have to rip these bits out eventually to get it merged. Might as well do that from the start. -- Dave ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 12:28 ` [PATCH 0/9] OpenVZ kernel based checkpointing/restart Cedric Le Goater 2008-09-03 13:59 ` [Devel] " Andrey Mirkin @ 2008-09-03 14:18 ` Serge E. Hallyn 1 sibling, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2008-09-03 14:18 UTC (permalink / raw) To: Cedric Le Goater; +Cc: Andrey Mirkin, containers, linux-kernel Quoting Cedric Le Goater (clg@fr.ibm.com): > Andrey Mirkin wrote: > > This patchset introduces kernel based checkpointing/restart as it is > > implemented in OpenVZ project. This patchset has limited functionality and > > are able to checkpoint/restart only single process. Recently Oren Laaden > > sent another kernel based implementation of checkpoint/restart. The main > > differences between this patchset and Oren's patchset are: > > > > * In this patchset checkpointing initiated not from the process > > (right now we do not have a container, only namespaces), Oren's patchset > > performs checkpointing from the process context. > > > > * Restart in this patchset is initiated from process, which restarts a new > > process (in new namespaces) with saved state. Oren's patchset uses the same > > process from which restart was initiated and restore saved state over it. > > > > * Checkpoint/restart functionality in this patchset is implemented as a kernel > > module > > why ? Do we really think that C/R implementations will be so different that > we will need C/R ops to support all of them ? I imagine that there could be > different models : At the mini-summit two reasons were brought up to make it a module: 1. So sysadmins worried about security implications can completely unload the module 2. So developers can unload and reload the module while testing. > 1. brute force : dump it all and kill > 2. incremental > 3. live migration > ... Actually I don't think we expected to use different implementations for those. > But I see all of them really tied to the kernel internals. > > The first issues I see with this direction are some EXPORT_SYMBOL() that would > be useless without a module. > > > As checkpointing is initiated not from the process which state should be saved > > we should freeze a process before saving its state. Right now Container Freezer > > from Matt Helsley can be used for this. > > OK that's integrated and Daniel's tools : > > http://lxc.cvs.sourceforge.net/lxc/ > > one more reason to work on integration :) > > C. > > > > This patchset introduce only a concept how kernel based checkpointing/restart > > can be implemented and are able to checkpoint/restart only a single process > > with simple VMAs. > > > > I've tried to split my patchset in small patches to make review more easier. > > _______________________________________________ > > Containers mailing list > > Containers@lists.linux-foundation.org > > https://lists.linux-foundation.org/mailman/listinfo/containers > > > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 10:57 [PATCH 0/9] OpenVZ kernel based checkpointing/restart Andrey Mirkin 2008-09-03 10:57 ` [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls Andrey Mirkin 2008-09-03 12:28 ` [PATCH 0/9] OpenVZ kernel based checkpointing/restart Cedric Le Goater @ 2008-09-03 13:49 ` Louis Rilling 2008-09-03 14:06 ` Louis Rilling 2008-09-04 8:14 ` Oren Laadan ` (2 subsequent siblings) 5 siblings, 1 reply; 69+ messages in thread From: Louis Rilling @ 2008-09-03 13:49 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers [-- Attachment #1: Type: text/plain, Size: 1792 bytes --] On Wed, Sep 03, 2008 at 02:57:47PM +0400, Andrey Mirkin wrote: > This patchset introduces kernel based checkpointing/restart as it is > implemented in OpenVZ project. This patchset has limited functionality and > are able to checkpoint/restart only single process. Recently Oren Laaden > sent another kernel based implementation of checkpoint/restart. The main > differences between this patchset and Oren's patchset are: > > * In this patchset checkpointing initiated not from the process > (right now we do not have a container, only namespaces), Oren's patchset > performs checkpointing from the process context. > > * Restart in this patchset is initiated from process, which restarts a new > process (in new namespaces) with saved state. Oren's patchset uses the same > process from which restart was initiated and restore saved state over it. > > * Checkpoint/restart functionality in this patchset is implemented as a kernel > module > > > As checkpointing is initiated not from the process which state should be saved > we should freeze a process before saving its state. Right now Container Freezer > from Matt Helsley can be used for this. > > This patchset introduce only a concept how kernel based checkpointing/restart > can be implemented and are able to checkpoint/restart only a single process > with simple VMAs. > > I've tried to split my patchset in small patches to make review more easier. Thank you Andrey for having highlighted the differences with Oren's approach, and for having split this patchset. Few remarks in reply to the patches. Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 13:49 ` Louis Rilling @ 2008-09-03 14:06 ` Louis Rilling 2008-09-03 14:19 ` Andrey Mirkin 2008-09-03 14:26 ` Cedric Le Goater 0 siblings, 2 replies; 69+ messages in thread From: Louis Rilling @ 2008-09-03 14:06 UTC (permalink / raw) To: Andrey Mirkin; +Cc: containers, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2507 bytes --] On Wed, Sep 03, 2008 at 03:49:51PM +0200, Louis Rilling wrote: > On Wed, Sep 03, 2008 at 02:57:47PM +0400, Andrey Mirkin wrote: > > This patchset introduces kernel based checkpointing/restart as it is > > implemented in OpenVZ project. This patchset has limited functionality and > > are able to checkpoint/restart only single process. Recently Oren Laaden > > sent another kernel based implementation of checkpoint/restart. The main > > differences between this patchset and Oren's patchset are: > > > > * In this patchset checkpointing initiated not from the process > > (right now we do not have a container, only namespaces), Oren's patchset > > performs checkpointing from the process context. > > > > * Restart in this patchset is initiated from process, which restarts a new > > process (in new namespaces) with saved state. Oren's patchset uses the same > > process from which restart was initiated and restore saved state over it. > > > > * Checkpoint/restart functionality in this patchset is implemented as a kernel > > module > > > > > > As checkpointing is initiated not from the process which state should be saved > > we should freeze a process before saving its state. Right now Container Freezer > > from Matt Helsley can be used for this. > > > > This patchset introduce only a concept how kernel based checkpointing/restart > > can be implemented and are able to checkpoint/restart only a single process > > with simple VMAs. > > > > I've tried to split my patchset in small patches to make review more easier. > > Thank you Andrey for having highlighted the differences with Oren's approach, > and for having split this patchset. Few remarks in reply to the patches. Forgot a global comment: you will probably get the same (rather pointless for a proof of concept, IMHO) requests as Oren got, to 1) improve coding style: a) especially avoid error handling like: err = foo(); if (!err) err = bar(); if (!err) err = baz(); and prefer err = foo(); if (err) goto foo_err; err = bar(); ... b) do not write conditions on a single line, like if (foo) bar; 2) put arch-dependent code in arch/ subdirs. 3) I probably forgot other ones. I obviously do not personally request you to take these requests into account ;) Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 14:06 ` Louis Rilling @ 2008-09-03 14:19 ` Andrey Mirkin 2008-09-03 14:26 ` Cedric Le Goater 1 sibling, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 14:19 UTC (permalink / raw) To: Louis.Rilling; +Cc: containers, linux-kernel On Wednesday 03 September 2008 18:06 Louis Rilling wrote: > On Wed, Sep 03, 2008 at 03:49:51PM +0200, Louis Rilling wrote: > > On Wed, Sep 03, 2008 at 02:57:47PM +0400, Andrey Mirkin wrote: > > > This patchset introduces kernel based checkpointing/restart as it is > > > implemented in OpenVZ project. This patchset has limited functionality > > > and are able to checkpoint/restart only single process. Recently Oren > > > Laaden sent another kernel based implementation of checkpoint/restart. > > > The main differences between this patchset and Oren's patchset are: > > > > > > * In this patchset checkpointing initiated not from the process > > > (right now we do not have a container, only namespaces), Oren's > > > patchset performs checkpointing from the process context. > > > > > > * Restart in this patchset is initiated from process, which restarts a > > > new process (in new namespaces) with saved state. Oren's patchset uses > > > the same process from which restart was initiated and restore saved > > > state over it. > > > > > > * Checkpoint/restart functionality in this patchset is implemented as a > > > kernel module > > > > > > > > > As checkpointing is initiated not from the process which state should > > > be saved we should freeze a process before saving its state. Right now > > > Container Freezer from Matt Helsley can be used for this. > > > > > > This patchset introduce only a concept how kernel based > > > checkpointing/restart can be implemented and are able to > > > checkpoint/restart only a single process with simple VMAs. > > > > > > I've tried to split my patchset in small patches to make review more > > > easier. > > > > Thank you Andrey for having highlighted the differences with Oren's > > approach, and for having split this patchset. Few remarks in reply to the > > patches. > > Forgot a global comment: you will probably get the same (rather pointless > for a proof of concept, IMHO) requests as Oren got, to > 1) improve coding style: > a) especially avoid error handling like: > err = foo(); > if (!err) > err = bar(); > if (!err) > err = baz(); > > and prefer > err = foo(); > if (err) > goto foo_err; > err = bar(); > ... > > b) do not write conditions on a single line, like > if (foo) bar; > > 2) put arch-dependent code in arch/ subdirs. > 3) I probably forgot other ones. > > I obviously do not personally request you to take these requests into > account ;) Thanks for comments. Unfortunately I have missed 2-3 weeks of discussion of Oren's patches. Right now I'm reading all threads devoted to checkpointing. I'll take into account you comments and I'll try to fix all comments in next version. Thanks, Andrey ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 14:06 ` Louis Rilling 2008-09-03 14:19 ` Andrey Mirkin @ 2008-09-03 14:26 ` Cedric Le Goater 2008-09-03 14:53 ` Andrey Mirkin 1 sibling, 1 reply; 69+ messages in thread From: Cedric Le Goater @ 2008-09-03 14:26 UTC (permalink / raw) To: Louis.Rilling; +Cc: Andrey Mirkin, containers, linux-kernel > Forgot a global comment: you will probably get the same (rather pointless for a > proof of concept, IMHO) requests as Oren got, to > 1) improve coding style: > a) especially avoid error handling like: > err = foo(); > if (!err) > err = bar(); > if (!err) > err = baz(); > > and prefer > err = foo(); > if (err) > goto foo_err; > err = bar(); > ... > > b) do not write conditions on a single line, like > if (foo) bar; > > 2) put arch-dependent code in arch/ subdirs. > 3) I probably forgot other ones. ./scripts/checkpatch.pl catches most of it (and it does catch a few on your patches) but Dave is even better ! :) Cheers, C. > I obviously do not personally request you to take these requests into account ;) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 14:26 ` Cedric Le Goater @ 2008-09-03 14:53 ` Andrey Mirkin 0 siblings, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-09-03 14:53 UTC (permalink / raw) To: Cedric Le Goater; +Cc: Louis.Rilling, containers, linux-kernel On Wednesday 03 September 2008 18:26 Cedric Le Goater wrote: > > Forgot a global comment: you will probably get the same (rather pointless > > for a proof of concept, IMHO) requests as Oren got, to > > 1) improve coding style: > > a) especially avoid error handling like: > > err = foo(); > > if (!err) > > err = bar(); > > if (!err) > > err = baz(); > > > > and prefer > > err = foo(); > > if (err) > > goto foo_err; > > err = bar(); > > ... > > > > b) do not write conditions on a single line, like > > if (foo) bar; > > > > 2) put arch-dependent code in arch/ subdirs. > > 3) I probably forgot other ones. > > ./scripts/checkpatch.pl catches most of it (and it does catch a few on your > patches) Thanks, I will use this script to brush up the code. > > but Dave is even better ! :) I'm waiting for his comments :) Regards, Andrey > > > I obviously do not personally request you to take these requests into > > account ;) > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 10:57 [PATCH 0/9] OpenVZ kernel based checkpointing/restart Andrey Mirkin ` (2 preceding siblings ...) 2008-09-03 13:49 ` Louis Rilling @ 2008-09-04 8:14 ` Oren Laadan 2008-09-04 14:05 ` Dave Hansen 2008-10-17 23:33 ` Dave Hansen 5 siblings, 0 replies; 69+ messages in thread From: Oren Laadan @ 2008-09-04 8:14 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers Andrey Mirkin wrote: > This patchset introduces kernel based checkpointing/restart as it is > implemented in OpenVZ project. This patchset has limited functionality and > are able to checkpoint/restart only single process. Recently Oren Laaden > sent another kernel based implementation of checkpoint/restart. The main > differences between this patchset and Oren's patchset are: Thanks Andrey. I'll take a close look at it by the weekend. Just a couple of comments after a very brief look at the code: > > * In this patchset checkpointing initiated not from the process > (right now we do not have a container, only namespaces), Oren's patchset > performs checkpointing from the process context. Actually, my patch is written to checkpoint from outside the context of the target task; for simplicity only, it is applied to the current process. > > * Restart in this patchset is initiated from process, which restarts a new > process (in new namespaces) with saved state. Oren's patchset uses the same > process from which restart was initiated and restore saved state over it. Here, too, handling of namespaces was left out in order to provide a simple proof of concept. In fact, spawning a new process and entering a new namespace can be easily done in user-space. Oren. > > * Checkpoint/restart functionality in this patchset is implemented as a kernel > module > > > As checkpointing is initiated not from the process which state should be saved > we should freeze a process before saving its state. Right now Container Freezer > from Matt Helsley can be used for this. > > This patchset introduce only a concept how kernel based checkpointing/restart > can be implemented and are able to checkpoint/restart only a single process > with simple VMAs. > > I've tried to split my patchset in small patches to make review more easier. > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 10:57 [PATCH 0/9] OpenVZ kernel based checkpointing/restart Andrey Mirkin ` (3 preceding siblings ...) 2008-09-04 8:14 ` Oren Laadan @ 2008-09-04 14:05 ` Dave Hansen 2008-10-17 23:33 ` Dave Hansen 5 siblings, 0 replies; 69+ messages in thread From: Dave Hansen @ 2008-09-04 14:05 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers So, just like the I/O controller patches, we surely can't just throw patch sets back and forth at each other. We're also sure to wear out any potential reviewers, especially on LKML. The differences you've described between this and Oren's patches are pretty small, all things considered. Would you be willing to simply help with Oren's patches or port things over from this set instead of working on a completely disjoint set? -- Dave ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-09-03 10:57 [PATCH 0/9] OpenVZ kernel based checkpointing/restart Andrey Mirkin ` (4 preceding siblings ...) 2008-09-04 14:05 ` Dave Hansen @ 2008-10-17 23:33 ` Dave Hansen 2008-10-20 11:10 ` Louis Rilling 2008-10-20 12:14 ` [Devel] " Andrey Mirkin 5 siblings, 2 replies; 69+ messages in thread From: Dave Hansen @ 2008-10-17 23:33 UTC (permalink / raw) To: Andrey Mirkin; +Cc: linux-kernel, containers On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: > This patchset introduces kernel based checkpointing/restart as it is > implemented in OpenVZ project. This patchset has limited functionality and > are able to checkpoint/restart only single process. Recently Oren Laaden > sent another kernel based implementation of checkpoint/restart. The main > differences between this patchset and Oren's patchset are: Hi Andrey, I'm curious what you want to happen with this patch set. Is there something specific in Oren's set that deficient which you need implemented? Are there some technical reasons you prefer this code? -- Dave ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-17 23:33 ` Dave Hansen @ 2008-10-20 11:10 ` Louis Rilling 2008-10-20 13:25 ` Daniel Lezcano 2008-10-20 16:36 ` Dave Hansen 2008-10-20 12:14 ` [Devel] " Andrey Mirkin 1 sibling, 2 replies; 69+ messages in thread From: Louis Rilling @ 2008-10-20 11:10 UTC (permalink / raw) To: Dave Hansen; +Cc: Andrey Mirkin, linux-kernel, containers, Oren Laadan [-- Attachment #1: Type: text/plain, Size: 1159 bytes --] On Fri, Oct 17, 2008 at 04:33:03PM -0700, Dave Hansen wrote: > On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: > > This patchset introduces kernel based checkpointing/restart as it is > > implemented in OpenVZ project. This patchset has limited functionality and > > are able to checkpoint/restart only single process. Recently Oren Laaden > > sent another kernel based implementation of checkpoint/restart. The main > > differences between this patchset and Oren's patchset are: > > Hi Andrey, > > I'm curious what you want to happen with this patch set. Is there > something specific in Oren's set that deficient which you need > implemented? Are there some technical reasons you prefer this code? To be fair, and since (IIRC) the initial intent was to start with OpenVZ's approach, shouldn't Oren answer the same questions with respect to Andrey's patchset? I'm afraid that we are forgetting to take the best from both approaches... Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 11:10 ` Louis Rilling @ 2008-10-20 13:25 ` Daniel Lezcano 2008-10-20 13:48 ` Cedric Le Goater 2008-10-20 15:53 ` Oren Laadan 2008-10-20 16:36 ` Dave Hansen 1 sibling, 2 replies; 69+ messages in thread From: Daniel Lezcano @ 2008-10-20 13:25 UTC (permalink / raw) To: Louis.Rilling; +Cc: Dave Hansen, containers, Andrey Mirkin, linux-kernel Louis Rilling wrote: > On Fri, Oct 17, 2008 at 04:33:03PM -0700, Dave Hansen wrote: >> On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: >>> This patchset introduces kernel based checkpointing/restart as it is >>> implemented in OpenVZ project. This patchset has limited functionality and >>> are able to checkpoint/restart only single process. Recently Oren Laaden >>> sent another kernel based implementation of checkpoint/restart. The main >>> differences between this patchset and Oren's patchset are: >> Hi Andrey, >> >> I'm curious what you want to happen with this patch set. Is there >> something specific in Oren's set that deficient which you need >> implemented? Are there some technical reasons you prefer this code? > > To be fair, and since (IIRC) the initial intent was to start with OpenVZ's > approach, shouldn't Oren answer the same questions with respect to Andrey's > patchset? > > I'm afraid that we are forgetting to take the best from both approaches... I agree with Louis. I played with Oren's patchset and tryed to port it on x86_64. I was able to sys_checkpoint/sys_restart but if you remove the restoring of the general registers, the restart still works. I am not an expert on asm, but my hypothesis is when we call sys_checkpoint the registers are saved on the stack by the syscall and when we restore the memory of the process, we restore the stack and the stacked registers are restored when exiting the sys_restart. That make me feel there is an important gap between external checkpoint and internal checkpoint. Dmitry's patchset is nice too, but IMO, it goes too far from what we decided to do at the container mini-summit. I think there are a lot of design questions to be solved before going further. IMHO we should look at Dmitry patchset and merge the external checkpoint code to Oren's patchset in order to checkpoint *one* process and have the process to restart itself. At this point, we can begin to talk about the restart itself, shall we have the kernel to fork the processes to be restarted ? shall we fork from userspace and implement some mechanism to have each processes to restart themselves ? etc... ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 13:25 ` Daniel Lezcano @ 2008-10-20 13:48 ` Cedric Le Goater 2008-10-20 13:49 ` Daniel Lezcano 2008-10-20 15:53 ` Oren Laadan 1 sibling, 1 reply; 69+ messages in thread From: Cedric Le Goater @ 2008-10-20 13:48 UTC (permalink / raw) To: Daniel Lezcano Cc: Louis.Rilling, Dave Hansen, containers, Andrey Mirkin, linux-kernel >> I'm afraid that we are forgetting to take the best from both >> approaches... > > I agree with Louis. > > I played with Oren's patchset and tryed to port it on x86_64. I was able > to sys_checkpoint/sys_restart but if you remove the restoring of the > general registers, the restart still works. I am not an expert on asm, > but my hypothesis is when we call sys_checkpoint the registers are saved > on the stack by the syscall and when we restore the memory of the > process, we restore the stack and the stacked registers are restored > when exiting the sys_restart. That make me feel there is an important > gap between external checkpoint and internal checkpoint. > > Dmitry's patchset is nice too, but IMO, it goes too far from what we I think you are talking about Andrey. C. > decided to do at the container mini-summit. I think there are a lot of > design questions to be solved before going further. > > IMHO we should look at Dmitry patchset and merge the external checkpoint > code to Oren's patchset in order to checkpoint *one* process and have > the process to restart itself. At this point, we can begin to talk about > the restart itself, shall we have the kernel to fork the processes to be > restarted ? shall we fork from userspace and implement some mechanism to > have each processes to restart themselves ? etc... ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 13:48 ` Cedric Le Goater @ 2008-10-20 13:49 ` Daniel Lezcano 0 siblings, 0 replies; 69+ messages in thread From: Daniel Lezcano @ 2008-10-20 13:49 UTC (permalink / raw) To: Cedric Le Goater Cc: Louis.Rilling, Dave Hansen, containers, Andrey Mirkin, linux-kernel Cedric Le Goater wrote: >>> I'm afraid that we are forgetting to take the best from both >>> approaches... >> I agree with Louis. >> >> I played with Oren's patchset and tryed to port it on x86_64. I was able >> to sys_checkpoint/sys_restart but if you remove the restoring of the >> general registers, the restart still works. I am not an expert on asm, >> but my hypothesis is when we call sys_checkpoint the registers are saved >> on the stack by the syscall and when we restore the memory of the >> process, we restore the stack and the stacked registers are restored >> when exiting the sys_restart. That make me feel there is an important >> gap between external checkpoint and internal checkpoint. >> >> Dmitry's patchset is nice too, but IMO, it goes too far from what we > > I think you are talking about Andrey. Yes :) ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 13:25 ` Daniel Lezcano 2008-10-20 13:48 ` Cedric Le Goater @ 2008-10-20 15:53 ` Oren Laadan 2008-10-20 16:37 ` Daniel Lezcano ` (2 more replies) 1 sibling, 3 replies; 69+ messages in thread From: Oren Laadan @ 2008-10-20 15:53 UTC (permalink / raw) To: Daniel Lezcano Cc: Louis.Rilling, linux-kernel, containers, Andrey Mirkin, Dave Hansen Daniel Lezcano wrote: > Louis Rilling wrote: >> On Fri, Oct 17, 2008 at 04:33:03PM -0700, Dave Hansen wrote: >>> On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: >>>> This patchset introduces kernel based checkpointing/restart as it is >>>> implemented in OpenVZ project. This patchset has limited functionality and >>>> are able to checkpoint/restart only single process. Recently Oren Laaden >>>> sent another kernel based implementation of checkpoint/restart. The main >>>> differences between this patchset and Oren's patchset are: >>> Hi Andrey, >>> >>> I'm curious what you want to happen with this patch set. Is there >>> something specific in Oren's set that deficient which you need >>> implemented? Are there some technical reasons you prefer this code? >> To be fair, and since (IIRC) the initial intent was to start with OpenVZ's >> approach, shouldn't Oren answer the same questions with respect to Andrey's >> patchset? >> >> I'm afraid that we are forgetting to take the best from both approaches... > > I agree with Louis. > > I played with Oren's patchset and tryed to port it on x86_64. I was able > to sys_checkpoint/sys_restart but if you remove the restoring of the > general registers, the restart still works. I am not an expert on asm, > but my hypothesis is when we call sys_checkpoint the registers are saved > on the stack by the syscall and when we restore the memory of the > process, we restore the stack and the stacked registers are restored > when exiting the sys_restart. That make me feel there is an important > gap between external checkpoint and internal checkpoint. This is a misconception: my patches are not "internal checkpoint". My patches are basically "external checkpoint" by design, which *also* accommodates self-checkpointing (aka internal). The same holds for the restart. The implementation is demonstrated with "self-checkpoint" to avoid complicating things at this early stage of proof-of-concept. For multiple processes all that is needed is a container and a loop on the checkpoint side, and a method to recreate processes on the restart side. Andrew suggests to do it in kernel space, I still have doubts. While I held out the multi-process part of the patch so far because I was explicitly asked to do it, it seems like this would be a good time to push it out and get feedback. > > Dmitry's patchset is nice too, but IMO, it goes too far from what we > decided to do at the container mini-summit. I think there are a lot of > design questions to be solved before going further. > > IMHO we should look at Dmitry patchset and merge the external checkpoint > code to Oren's patchset in order to checkpoint *one* process and have > the process to restart itself. At this point, we can begin to talk about > the restart itself, shall we have the kernel to fork the processes to be > restarted ? shall we fork from userspace and implement some mechanism to > have each processes to restart themselves ? etc... > In both approaches, processes restart themselves, in the sense that a process to be restarted eventually calls "do_restart()" (or equivalent). The only question is how processes are created. Andrew's patch creates everything inside the kernel. I would like to still give it a try outside the kernel. Everything is ready, except that we need a way to pre-select a PID for the new child... we never agreed on that one, did we ? If we go ahead with the kernel-based process creation, it's easy to merge it to the current patch-set. Oren. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 15:53 ` Oren Laadan @ 2008-10-20 16:37 ` Daniel Lezcano 2008-10-20 17:23 ` Serge E. Hallyn 2008-10-27 14:45 ` [Devel] " Andrey Mirkin 2008-10-20 16:51 ` Serge E. Hallyn 2008-10-21 9:36 ` Cedric Le Goater 2 siblings, 2 replies; 69+ messages in thread From: Daniel Lezcano @ 2008-10-20 16:37 UTC (permalink / raw) To: Oren Laadan Cc: Louis.Rilling, linux-kernel, containers, Andrey Mirkin, Dave Hansen Oren Laadan wrote: > > Daniel Lezcano wrote: >> Louis Rilling wrote: >>> On Fri, Oct 17, 2008 at 04:33:03PM -0700, Dave Hansen wrote: >>>> On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: >>>>> This patchset introduces kernel based checkpointing/restart as it is >>>>> implemented in OpenVZ project. This patchset has limited functionality and >>>>> are able to checkpoint/restart only single process. Recently Oren Laaden >>>>> sent another kernel based implementation of checkpoint/restart. The main >>>>> differences between this patchset and Oren's patchset are: >>>> Hi Andrey, >>>> >>>> I'm curious what you want to happen with this patch set. Is there >>>> something specific in Oren's set that deficient which you need >>>> implemented? Are there some technical reasons you prefer this code? >>> To be fair, and since (IIRC) the initial intent was to start with OpenVZ's >>> approach, shouldn't Oren answer the same questions with respect to Andrey's >>> patchset? >>> >>> I'm afraid that we are forgetting to take the best from both approaches... >> I agree with Louis. >> >> I played with Oren's patchset and tryed to port it on x86_64. I was able >> to sys_checkpoint/sys_restart but if you remove the restoring of the >> general registers, the restart still works. I am not an expert on asm, >> but my hypothesis is when we call sys_checkpoint the registers are saved >> on the stack by the syscall and when we restore the memory of the >> process, we restore the stack and the stacked registers are restored >> when exiting the sys_restart. That make me feel there is an important >> gap between external checkpoint and internal checkpoint. > > This is a misconception: my patches are not "internal checkpoint". My > patches are basically "external checkpoint" by design, which *also* > accommodates self-checkpointing (aka internal). The same holds for the > restart. The implementation is demonstrated with "self-checkpoint" to > avoid complicating things at this early stage of proof-of-concept. Yep, I read your patchset :) I just want to clarify what we want to demonstrate with this patchset for the proof-of-concept ? A self CR does not show what are the complicate parts of the CR, we are just showing we can dump the memory from the kernel and do setcontext/getcontext. We state at the container mini-summit on an approach: 1. Pre-dump 2. Freeze the container 3. Dump 4. Thaw/Kill the container 5. Post-dump We already have the freezer, and we can forget for now pre-dump and post-dump. IMHO, for the proof-of-concept we should do a minimal CR (like you did), but conforming with these 5 points, but that means we have to do an external checkpoint. If the POC conforms with that, the patchset will be a little different and that will show what are the difficult part for restarting a process, especially to restart it at the frozen state :) and that will give an idea from 10000 feets of the big picture. > For multiple processes all that is needed is a container and a loop > on the checkpoint side, and a method to recreate processes on the > restart side. Andrew suggests to do it in kernel space, I still have > doubts. A question to Andrey, do you, in OpenVZ, restart "externally" or it is the first process of the pid namespace which calls sys_restart and then populates the pid namespace ? > While I held out the multi-process part of the patch so far because I > was explicitly asked to do it, it seems like this would be a good time > to push it out and get feedback. IMHO it is too soon... ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 16:37 ` Daniel Lezcano @ 2008-10-20 17:23 ` Serge E. Hallyn 2008-10-21 0:18 ` Oren Laadan 2008-10-27 14:45 ` [Devel] " Andrey Mirkin 1 sibling, 1 reply; 69+ messages in thread From: Serge E. Hallyn @ 2008-10-20 17:23 UTC (permalink / raw) To: Daniel Lezcano Cc: Oren Laadan, Louis.Rilling, containers, Dave Hansen, linux-kernel, Andrey Mirkin Quoting Daniel Lezcano (dlezcano@fr.ibm.com): > Oren Laadan wrote: > > > > Daniel Lezcano wrote: > >> Louis Rilling wrote: > >>> On Fri, Oct 17, 2008 at 04:33:03PM -0700, Dave Hansen wrote: > >>>> On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: > >>>>> This patchset introduces kernel based checkpointing/restart as it is > >>>>> implemented in OpenVZ project. This patchset has limited functionality and > >>>>> are able to checkpoint/restart only single process. Recently Oren Laaden > >>>>> sent another kernel based implementation of checkpoint/restart. The main > >>>>> differences between this patchset and Oren's patchset are: > >>>> Hi Andrey, > >>>> > >>>> I'm curious what you want to happen with this patch set. Is there > >>>> something specific in Oren's set that deficient which you need > >>>> implemented? Are there some technical reasons you prefer this code? > >>> To be fair, and since (IIRC) the initial intent was to start with OpenVZ's > >>> approach, shouldn't Oren answer the same questions with respect to Andrey's > >>> patchset? > >>> > >>> I'm afraid that we are forgetting to take the best from both approaches... > >> I agree with Louis. > >> > >> I played with Oren's patchset and tryed to port it on x86_64. I was able > >> to sys_checkpoint/sys_restart but if you remove the restoring of the > >> general registers, the restart still works. I am not an expert on asm, > >> but my hypothesis is when we call sys_checkpoint the registers are saved > >> on the stack by the syscall and when we restore the memory of the > >> process, we restore the stack and the stacked registers are restored > >> when exiting the sys_restart. That make me feel there is an important > >> gap between external checkpoint and internal checkpoint. > > > > This is a misconception: my patches are not "internal checkpoint". My > > patches are basically "external checkpoint" by design, which *also* > > accommodates self-checkpointing (aka internal). The same holds for the > > restart. The implementation is demonstrated with "self-checkpoint" to > > avoid complicating things at this early stage of proof-of-concept. > > Yep, I read your patchset :) > > I just want to clarify what we want to demonstrate with this patchset > for the proof-of-concept ? A self CR does not show what are the > complicate parts of the CR, we are just showing we can dump the memory > from the kernel and do setcontext/getcontext. > > We state at the container mini-summit on an approach: > > 1. Pre-dump > 2. Freeze the container > 3. Dump > 4. Thaw/Kill the container > 5. Post-dump > > We already have the freezer, and we can forget for now pre-dump and > post-dump. > > IMHO, for the proof-of-concept we should do a minimal CR (like you did), > but conforming with these 5 points, but that means we have to do an > external checkpoint. Right, Oren, iiuc you are insisting that 'external checkpoint' and 'multiple task checkpoint' are the same thing. But they aren't. Rather, I think that what we say is 'multiple tasks c/r' is what you say should be done from user-space :) So particularly given that your patchset seems to be in good shape, I'd like to see external checkpoint explicitly supported. Please just call me a dunce if v7 already works for that. thanks, -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 17:23 ` Serge E. Hallyn @ 2008-10-21 0:18 ` Oren Laadan 2008-10-21 0:58 ` Serge E. Hallyn 2008-10-21 13:24 ` Daniel Lezcano 0 siblings, 2 replies; 69+ messages in thread From: Oren Laadan @ 2008-10-21 0:18 UTC (permalink / raw) To: Serge E. Hallyn Cc: Daniel Lezcano, Louis.Rilling, containers, Dave Hansen, linux-kernel, Andrey Mirkin Serge E. Hallyn wrote: > Quoting Daniel Lezcano (dlezcano@fr.ibm.com): >> Oren Laadan wrote: >>> Daniel Lezcano wrote: >>>> Louis Rilling wrote: >>>>> On Fri, Oct 17, 2008 at 04:33:03PM -0700, Dave Hansen wrote: >>>>>> On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: >>>>>>> This patchset introduces kernel based checkpointing/restart as it is >>>>>>> implemented in OpenVZ project. This patchset has limited functionality and >>>>>>> are able to checkpoint/restart only single process. Recently Oren Laaden >>>>>>> sent another kernel based implementation of checkpoint/restart. The main >>>>>>> differences between this patchset and Oren's patchset are: >>>>>> Hi Andrey, >>>>>> >>>>>> I'm curious what you want to happen with this patch set. Is there >>>>>> something specific in Oren's set that deficient which you need >>>>>> implemented? Are there some technical reasons you prefer this code? >>>>> To be fair, and since (IIRC) the initial intent was to start with OpenVZ's >>>>> approach, shouldn't Oren answer the same questions with respect to Andrey's >>>>> patchset? >>>>> >>>>> I'm afraid that we are forgetting to take the best from both approaches... >>>> I agree with Louis. >>>> >>>> I played with Oren's patchset and tryed to port it on x86_64. I was able >>>> to sys_checkpoint/sys_restart but if you remove the restoring of the >>>> general registers, the restart still works. I am not an expert on asm, >>>> but my hypothesis is when we call sys_checkpoint the registers are saved >>>> on the stack by the syscall and when we restore the memory of the >>>> process, we restore the stack and the stacked registers are restored >>>> when exiting the sys_restart. That make me feel there is an important >>>> gap between external checkpoint and internal checkpoint. >>> This is a misconception: my patches are not "internal checkpoint". My >>> patches are basically "external checkpoint" by design, which *also* >>> accommodates self-checkpointing (aka internal). The same holds for the >>> restart. The implementation is demonstrated with "self-checkpoint" to >>> avoid complicating things at this early stage of proof-of-concept. >> Yep, I read your patchset :) >> >> I just want to clarify what we want to demonstrate with this patchset >> for the proof-of-concept ? A self CR does not show what are the >> complicate parts of the CR, we are just showing we can dump the memory >> from the kernel and do setcontext/getcontext. >> >> We state at the container mini-summit on an approach: >> >> 1. Pre-dump >> 2. Freeze the container >> 3. Dump >> 4. Thaw/Kill the container >> 5. Post-dump >> >> We already have the freezer, and we can forget for now pre-dump and >> post-dump. >> >> IMHO, for the proof-of-concept we should do a minimal CR (like you did), >> but conforming with these 5 points, but that means we have to do an >> external checkpoint. > > Right, Oren, iiuc you are insisting that 'external checkpoint' and > 'multiple task checkpoint' are the same thing. But they aren't. > Rather, I think that what we say is 'multiple tasks c/r' is what you say > should be done from user-space :) Then I don't explain myself clearly :) The only thing I consider doing in user space is the creation of the container, the namespaces and the processes. I argue that "external checkpoint of a single process" is very few lines of code away from "self checkpoint" that is in v7. I'm not sure how you define "external restart" ? eventually, the processes restart themselves. It is a question of how the processes are created to begin with. > > So particularly given that your patchset seems to be in good shape, > I'd like to see external checkpoint explicitly supported. Please > just call me a dunce if v7 already works for that. > It seems like you want a single process to checkpoint a single (other) process, and then a single process to start a single (other) process. I tried to explicitly avoid dealing with the container (user space ? kernel space ?) and with creating new processes (user space ? kernel space ?). Nevertheless, it's the _same_ code. All that is needed is to make the checkpoint syscall refer to the other task instead of self, and the restart should create a container and fork there, then call sys_restart. I guess instead of repeating this argument over, I will go ahead and post a patch on top of v7 to demonstrate this (without a container, therefore without preserving the original pid). Oren. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-21 0:18 ` Oren Laadan @ 2008-10-21 0:58 ` Serge E. Hallyn 2008-10-21 13:24 ` Daniel Lezcano 1 sibling, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2008-10-21 0:58 UTC (permalink / raw) To: Oren Laadan Cc: Daniel Lezcano, Louis.Rilling, containers, Dave Hansen, linux-kernel, Andrey Mirkin Quoting Oren Laadan (orenl@cs.columbia.edu): > > > Serge E. Hallyn wrote: > > Quoting Daniel Lezcano (dlezcano@fr.ibm.com): > >> Oren Laadan wrote: > >>> Daniel Lezcano wrote: > >>>> Louis Rilling wrote: > >>>>> On Fri, Oct 17, 2008 at 04:33:03PM -0700, Dave Hansen wrote: > >>>>>> On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: > >>>>>>> This patchset introduces kernel based checkpointing/restart as it is > >>>>>>> implemented in OpenVZ project. This patchset has limited functionality and > >>>>>>> are able to checkpoint/restart only single process. Recently Oren Laaden > >>>>>>> sent another kernel based implementation of checkpoint/restart. The main > >>>>>>> differences between this patchset and Oren's patchset are: > >>>>>> Hi Andrey, > >>>>>> > >>>>>> I'm curious what you want to happen with this patch set. Is there > >>>>>> something specific in Oren's set that deficient which you need > >>>>>> implemented? Are there some technical reasons you prefer this code? > >>>>> To be fair, and since (IIRC) the initial intent was to start with OpenVZ's > >>>>> approach, shouldn't Oren answer the same questions with respect to Andrey's > >>>>> patchset? > >>>>> > >>>>> I'm afraid that we are forgetting to take the best from both approaches... > >>>> I agree with Louis. > >>>> > >>>> I played with Oren's patchset and tryed to port it on x86_64. I was able > >>>> to sys_checkpoint/sys_restart but if you remove the restoring of the > >>>> general registers, the restart still works. I am not an expert on asm, > >>>> but my hypothesis is when we call sys_checkpoint the registers are saved > >>>> on the stack by the syscall and when we restore the memory of the > >>>> process, we restore the stack and the stacked registers are restored > >>>> when exiting the sys_restart. That make me feel there is an important > >>>> gap between external checkpoint and internal checkpoint. > >>> This is a misconception: my patches are not "internal checkpoint". My > >>> patches are basically "external checkpoint" by design, which *also* > >>> accommodates self-checkpointing (aka internal). The same holds for the > >>> restart. The implementation is demonstrated with "self-checkpoint" to > >>> avoid complicating things at this early stage of proof-of-concept. > >> Yep, I read your patchset :) > >> > >> I just want to clarify what we want to demonstrate with this patchset > >> for the proof-of-concept ? A self CR does not show what are the > >> complicate parts of the CR, we are just showing we can dump the memory > >> from the kernel and do setcontext/getcontext. > >> > >> We state at the container mini-summit on an approach: > >> > >> 1. Pre-dump > >> 2. Freeze the container > >> 3. Dump > >> 4. Thaw/Kill the container > >> 5. Post-dump > >> > >> We already have the freezer, and we can forget for now pre-dump and > >> post-dump. > >> > >> IMHO, for the proof-of-concept we should do a minimal CR (like you did), > >> but conforming with these 5 points, but that means we have to do an > >> external checkpoint. > > > > Right, Oren, iiuc you are insisting that 'external checkpoint' and > > 'multiple task checkpoint' are the same thing. But they aren't. > > Rather, I think that what we say is 'multiple tasks c/r' is what you say > > should be done from user-space :) > > Then I don't explain myself clearly :) > > The only thing I consider doing in user space is the creation of > the container, the namespaces and the processes. That I understand. > I argue that "external checkpoint of a single process" is very few > lines of code away from "self checkpoint" that is in v7. > > I'm not sure how you define "external restart" ? eventually, the If I ever said external restart, I actually meant external checkpoint. I understand that a task should call sys_restart() itself. > processes restart themselves. It is a question of how the processes > are created to begin with. > > > > > So particularly given that your patchset seems to be in good shape, > > I'd like to see external checkpoint explicitly supported. Please > > just call me a dunce if v7 already works for that. > > > > It seems like you want a single process to checkpoint a single (other) > process, and then a single process to start a single (other) process. Yup. > I tried to explicitly avoid dealing with the container (user space ? > kernel space ?) and with creating new processes (user space ? kernel > space ?). And that's the right thing to do. But: > Nevertheless, it's the _same_ code. All that is needed is to make the I was under the impression that sys_checkpoint() on some other task's pid and then restarting with that image would fail right now. > checkpoint syscall refer to the other task instead of self, and the > restart should create a container and fork there, then call sys_restart. > > I guess instead of repeating this argument over, I will go ahead and > post a patch on top of v7 to demonstrate this (without a container, Cool, thanks! > therefore without preserving the original pid). Yes, as i believe you said in another email earlier today, we have not decided about how to restore the pid. Eric continues to argue for playing games with /proc/sys/kernel/pid_max. -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-21 0:18 ` Oren Laadan 2008-10-21 0:58 ` Serge E. Hallyn @ 2008-10-21 13:24 ` Daniel Lezcano 1 sibling, 0 replies; 69+ messages in thread From: Daniel Lezcano @ 2008-10-21 13:24 UTC (permalink / raw) To: Oren Laadan Cc: Serge E. Hallyn, Louis.Rilling, containers, Dave Hansen, linux-kernel, Andrey Mirkin Oren Laadan wrote: > > Serge E. Hallyn wrote: >> Quoting Daniel Lezcano (dlezcano@fr.ibm.com): >>> Oren Laadan wrote: >>>> Daniel Lezcano wrote: >>>>> Louis Rilling wrote: >>>>>> On Fri, Oct 17, 2008 at 04:33:03PM -0700, Dave Hansen wrote: >>>>>>> On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: >>>>>>>> This patchset introduces kernel based checkpointing/restart as it is >>>>>>>> implemented in OpenVZ project. This patchset has limited functionality and >>>>>>>> are able to checkpoint/restart only single process. Recently Oren Laaden >>>>>>>> sent another kernel based implementation of checkpoint/restart. The main >>>>>>>> differences between this patchset and Oren's patchset are: >>>>>>> Hi Andrey, >>>>>>> >>>>>>> I'm curious what you want to happen with this patch set. Is there >>>>>>> something specific in Oren's set that deficient which you need >>>>>>> implemented? Are there some technical reasons you prefer this code? >>>>>> To be fair, and since (IIRC) the initial intent was to start with OpenVZ's >>>>>> approach, shouldn't Oren answer the same questions with respect to Andrey's >>>>>> patchset? >>>>>> >>>>>> I'm afraid that we are forgetting to take the best from both approaches... >>>>> I agree with Louis. >>>>> >>>>> I played with Oren's patchset and tryed to port it on x86_64. I was able >>>>> to sys_checkpoint/sys_restart but if you remove the restoring of the >>>>> general registers, the restart still works. I am not an expert on asm, >>>>> but my hypothesis is when we call sys_checkpoint the registers are saved >>>>> on the stack by the syscall and when we restore the memory of the >>>>> process, we restore the stack and the stacked registers are restored >>>>> when exiting the sys_restart. That make me feel there is an important >>>>> gap between external checkpoint and internal checkpoint. >>>> This is a misconception: my patches are not "internal checkpoint". My >>>> patches are basically "external checkpoint" by design, which *also* >>>> accommodates self-checkpointing (aka internal). The same holds for the >>>> restart. The implementation is demonstrated with "self-checkpoint" to >>>> avoid complicating things at this early stage of proof-of-concept. >>> Yep, I read your patchset :) >>> >>> I just want to clarify what we want to demonstrate with this patchset >>> for the proof-of-concept ? A self CR does not show what are the >>> complicate parts of the CR, we are just showing we can dump the memory >>> from the kernel and do setcontext/getcontext. >>> >>> We state at the container mini-summit on an approach: >>> >>> 1. Pre-dump >>> 2. Freeze the container >>> 3. Dump >>> 4. Thaw/Kill the container >>> 5. Post-dump >>> >>> We already have the freezer, and we can forget for now pre-dump and >>> post-dump. >>> >>> IMHO, for the proof-of-concept we should do a minimal CR (like you did), >>> but conforming with these 5 points, but that means we have to do an >>> external checkpoint. >> Right, Oren, iiuc you are insisting that 'external checkpoint' and >> 'multiple task checkpoint' are the same thing. But they aren't. >> Rather, I think that what we say is 'multiple tasks c/r' is what you say >> should be done from user-space :) > > Then I don't explain myself clearly :) > > The only thing I consider doing in user space is the creation of > the container, the namespaces and the processes. > > I argue that "external checkpoint of a single process" is very few > lines of code away from "self checkpoint" that is in v7. > > I'm not sure how you define "external restart" ? eventually, the > processes restart themselves. It is a question of how the processes > are created to begin with. > >> So particularly given that your patchset seems to be in good shape, >> I'd like to see external checkpoint explicitly supported. Please >> just call me a dunce if v7 already works for that. >> > > It seems like you want a single process to checkpoint a single (other) > process, and then a single process to start a single (other) process. > > I tried to explicitly avoid dealing with the container (user space ? > kernel space ?) and with creating new processes (user space ? kernel > space ?). > > Nevertheless, it's the _same_ code. All that is needed is to make the > checkpoint syscall refer to the other task instead of self, and the > restart should create a container and fork there, then call sys_restart. > > I guess instead of repeating this argument over, I will go ahead and > post a patch on top of v7 to demonstrate this (without a container, > therefore without preserving the original pid). Cedric made a patch for the external checkpoint: http://lxc.sourceforge.net/patches/2.6.27/2.6.27-rc8-lxc1/0035-enable-external-checkpoint.patch The main difference is you will need to freeze the process because it will not block itself via a syscall (there is the freezer patchset). For the restart, perhaps you can just do a process calling sys_restart and so we delay the fork from user/kernel discussion, no ? ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 16:37 ` Daniel Lezcano 2008-10-20 17:23 ` Serge E. Hallyn @ 2008-10-27 14:45 ` Andrey Mirkin 1 sibling, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-10-27 14:45 UTC (permalink / raw) To: devel Cc: Daniel Lezcano, Oren Laadan, Louis.Rilling, containers, linux-kernel, Dave Hansen On Monday 20 October 2008 20:37 Daniel Lezcano wrote: > Oren Laadan wrote: > > Daniel Lezcano wrote: > >> Louis Rilling wrote: > >>> On Fri, Oct 17, 2008 at 04:33:03PM -0700, Dave Hansen wrote: > >>>> On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: > >>>>> This patchset introduces kernel based checkpointing/restart as it is > >>>>> implemented in OpenVZ project. This patchset has limited > >>>>> functionality and are able to checkpoint/restart only single process. > >>>>> Recently Oren Laaden sent another kernel based implementation of > >>>>> checkpoint/restart. The main differences between this patchset and > >>>>> Oren's patchset are: > >>>> > >>>> Hi Andrey, > >>>> > >>>> I'm curious what you want to happen with this patch set. Is there > >>>> something specific in Oren's set that deficient which you need > >>>> implemented? Are there some technical reasons you prefer this code? > >>> > >>> To be fair, and since (IIRC) the initial intent was to start with > >>> OpenVZ's approach, shouldn't Oren answer the same questions with > >>> respect to Andrey's patchset? > >>> > >>> I'm afraid that we are forgetting to take the best from both > >>> approaches... > >> > >> I agree with Louis. > >> > >> I played with Oren's patchset and tryed to port it on x86_64. I was able > >> to sys_checkpoint/sys_restart but if you remove the restoring of the > >> general registers, the restart still works. I am not an expert on asm, > >> but my hypothesis is when we call sys_checkpoint the registers are saved > >> on the stack by the syscall and when we restore the memory of the > >> process, we restore the stack and the stacked registers are restored > >> when exiting the sys_restart. That make me feel there is an important > >> gap between external checkpoint and internal checkpoint. > > > > This is a misconception: my patches are not "internal checkpoint". My > > patches are basically "external checkpoint" by design, which *also* > > accommodates self-checkpointing (aka internal). The same holds for the > > restart. The implementation is demonstrated with "self-checkpoint" to > > avoid complicating things at this early stage of proof-of-concept. > > Yep, I read your patchset :) > > I just want to clarify what we want to demonstrate with this patchset > for the proof-of-concept ? A self CR does not show what are the > complicate parts of the CR, we are just showing we can dump the memory > from the kernel and do setcontext/getcontext. > > We state at the container mini-summit on an approach: > > 1. Pre-dump > 2. Freeze the container > 3. Dump > 4. Thaw/Kill the container > 5. Post-dump > > We already have the freezer, and we can forget for now pre-dump and > post-dump. > > IMHO, for the proof-of-concept we should do a minimal CR (like you did), > but conforming with these 5 points, but that means we have to do an > external checkpoint. > > If the POC conforms with that, the patchset will be a little different > and that will show what are the difficult part for restarting a process, > especially to restart it at the frozen state :) and that will give an > idea from 10000 feets of the big picture. > > > For multiple processes all that is needed is a container and a loop > > on the checkpoint side, and a method to recreate processes on the > > restart side. Andrew suggests to do it in kernel space, I still have > > doubts. > > A question to Andrey, do you, in OpenVZ, restart "externally" or it is > the first process of the pid namespace which calls sys_restart and then > populates the pid namespace ? In OpenVZ we are creating first task and namespaces from sys_restart. Andrey > > > While I held out the multi-process part of the patch so far because I > > was explicitly asked to do it, it seems like this would be a good time > > to push it out and get feedback. > > IMHO it is too soon... > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers > > _______________________________________________ > Devel mailing list > Devel@openvz.org > https://openvz.org/mailman/listinfo/devel ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 15:53 ` Oren Laadan 2008-10-20 16:37 ` Daniel Lezcano @ 2008-10-20 16:51 ` Serge E. Hallyn 2008-10-21 9:36 ` Cedric Le Goater 2 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2008-10-20 16:51 UTC (permalink / raw) To: Oren Laadan Cc: Daniel Lezcano, Louis.Rilling, containers, Dave Hansen, linux-kernel, Andrey Mirkin Quoting Oren Laadan (orenl@cs.columbia.edu): > This is a misconception: my patches are not "internal checkpoint". My > patches are basically "external checkpoint" by design, which *also* > accommodates self-checkpointing (aka internal). The same holds for the > restart. The implementation is demonstrated with "self-checkpoint" to > avoid complicating things at this early stage of proof-of-concept. > > For multiple processes all that is needed is a container and a loop > on the checkpoint side, and a method to recreate processes on the > restart side. Andrew suggests to do it in kernel space, I still have > doubts. Yes I still prefer in-kernel. Can you elaborate on advantages of doing more work in userspace? > While I held out the multi-process part of the patch so far because I Yup, and i appreciate your restraint until now :) It made your patchset much easier to review. > was explicitly asked to do it, it seems like this would be a good time > to push it out and get feedback. Can you send that out as a patch(set) on top of your v7? I'd love to see (and test) it. thanks, -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 15:53 ` Oren Laadan 2008-10-20 16:37 ` Daniel Lezcano 2008-10-20 16:51 ` Serge E. Hallyn @ 2008-10-21 9:36 ` Cedric Le Goater 2 siblings, 0 replies; 69+ messages in thread From: Cedric Le Goater @ 2008-10-21 9:36 UTC (permalink / raw) To: Oren Laadan Cc: Daniel Lezcano, Louis.Rilling, containers, Dave Hansen, linux-kernel, Andrey Mirkin >> IMHO we should look at Dmitry patchset and merge the external checkpoint >> code to Oren's patchset in order to checkpoint *one* process and have >> the process to restart itself. At this point, we can begin to talk about >> the restart itself, shall we have the kernel to fork the processes to be >> restarted ? shall we fork from userspace and implement some mechanism to >> have each processes to restart themselves ? etc... >> > > In both approaches, processes restart themselves, in the sense that a > process to be restarted eventually calls "do_restart()" (or equivalent). > > The only question is how processes are created. Andrew's patch creates > everything inside the kernel. I would like to still give it a try outside > the kernel. Everything is ready, except that we need a way to pre-select > a PID for the new child... we never agreed on that one, did we ? what do you mean ? like a clone_with_pid() routine ? > If we go ahead with the kernel-based process creation, it's easy to merge > it to the current patch-set. Both solution are valid. Nevertheless, I would choose the solution sharing existing code and being arch independent. Now, we can start by duplicating code and see later how we could share. But for mainline inclusion, I think that code reuse is a faster path. C. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 11:10 ` Louis Rilling 2008-10-20 13:25 ` Daniel Lezcano @ 2008-10-20 16:36 ` Dave Hansen 1 sibling, 0 replies; 69+ messages in thread From: Dave Hansen @ 2008-10-20 16:36 UTC (permalink / raw) To: Louis.Rilling; +Cc: containers, Andrey Mirkin, linux-kernel On Mon, 2008-10-20 at 13:10 +0200, Louis Rilling wrote: > To be fair, and since (IIRC) the initial intent was to start with OpenVZ's > approach, shouldn't Oren answer the same questions with respect to Andrey's > patchset? I'm only really "supporting" Oren's patch set because he got it out way before the OpenVZ one showed up, and has kept integrating improvements into it to keep me interested. The OpenVZ approach does a few things that Oren's does not, and it *is* a bit more mature. I'm quite willing to jump camps if there's something compelling in what Andrey has posted, even though I've put quite a bit of time in to reviewing and improving Oren's. I'm looking for that "something". :) -- Dave ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-17 23:33 ` Dave Hansen 2008-10-20 11:10 ` Louis Rilling @ 2008-10-20 12:14 ` Andrey Mirkin 2008-10-20 15:55 ` Dave Hansen 2008-10-20 17:17 ` Oren Laadan 1 sibling, 2 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-10-20 12:14 UTC (permalink / raw) To: devel; +Cc: Dave Hansen, Pavel Emelyanov, containers, linux-kernel On Saturday 18 October 2008 03:33 Dave Hansen wrote: > On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: > > This patchset introduces kernel based checkpointing/restart as it is > > implemented in OpenVZ project. This patchset has limited functionality > > and are able to checkpoint/restart only single process. Recently Oren > > Laaden sent another kernel based implementation of checkpoint/restart. > > The main differences between this patchset and Oren's patchset are: > > Hi Andrey, > > I'm curious what you want to happen with this patch set. Is there > something specific in Oren's set that deficient which you need > implemented? Are there some technical reasons you prefer this code? Hi Dave, Right now my patchset (v2) provides an ability to checkpoint and restart a group of processes. The process of checkpointing and restart can be initiated from external process (not from the process which should be checkpointed). Also I think that all the restart job (including process forking) should be done in kernel, as in this case we will not depend on user space and will be more secure. This is also implemented in my patchset. Andrey ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 12:14 ` [Devel] " Andrey Mirkin @ 2008-10-20 15:55 ` Dave Hansen 2008-10-27 14:07 ` Andrey Mirkin 2008-10-20 17:17 ` Oren Laadan 1 sibling, 1 reply; 69+ messages in thread From: Dave Hansen @ 2008-10-20 15:55 UTC (permalink / raw) To: Andrey Mirkin; +Cc: devel, containers, Pavel Emelyanov, linux-kernel On Mon, 2008-10-20 at 16:14 +0400, Andrey Mirkin wrote: > Right now my patchset (v2) provides an ability to checkpoint and restart a > group of processes. The process of checkpointing and restart can be initiated > from external process (not from the process which should be checkpointed). Absolutely. Oren's code does it this way to make for a smaller patch at first. The syscall takes a pid argument so it is surely expected to be expanded upon later. > Also I think that all the restart job (including process forking) should be > done in kernel, as in this case we will not depend on user space and will be > more secure. This is also implemented in my patchset. Do you think that this is an approach that Oren's patches are married to, or is this a "feature" we can add on later? I don't care which patch set we end up sticking in the kernel. I'm trying to figure out which code we can more easily build upon in the future. The fact that Oren's or yours can't do certain little things right now does not bother me. Honestly, I'm a little more confident that everyone can work with Oren since he managed to get 7 revisions of his patch out and make some pretty large changes while in the same time the OpenVZ patch was only released twice. I'm not sure what has changed in the OpenVZ patch between releases, either. Are there any reasons that you absolutely can not use the code Oren posted? Will it not fulfill your needs somehow? If so, could you please elaborate on how? -- Dave ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 15:55 ` Dave Hansen @ 2008-10-27 14:07 ` Andrey Mirkin 2008-10-27 14:39 ` Oren Laadan 2008-11-03 19:35 ` Oren Laadan 0 siblings, 2 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-10-27 14:07 UTC (permalink / raw) To: Dave Hansen; +Cc: devel, containers, linux-kernel On Monday 20 October 2008 19:55 Dave Hansen wrote: > On Mon, 2008-10-20 at 16:14 +0400, Andrey Mirkin wrote: > > Right now my patchset (v2) provides an ability to checkpoint and restart > > a group of processes. The process of checkpointing and restart can be > > initiated from external process (not from the process which should be > > checkpointed). > > Absolutely. Oren's code does it this way to make for a smaller patch at > first. The syscall takes a pid argument so it is surely expected to be > expanded upon later. > > > Also I think that all the restart job (including process forking) should > > be done in kernel, as in this case we will not depend on user space and > > will be more secure. This is also implemented in my patchset. > > Do you think that this is an approach that Oren's patches are married > to, or is this a "feature" we can add on later? Well, AFAICS from Oren's patch set his approach is oriented on process creation in user space. I think we should choose right now what approach will be used for process creation. We have two options here: fork processes in kernel or fork them in user space. If process will be forked in user space, then there will be a gap when process will be in user space and can be killed with received signal before entering kernel. Also we will need a functionolity to create processes with predefined PID. I think it is not very good to provide such ability to user space. That is why we prefer in OpenVZ to do all the job in kernel. > I don't care which patch set we end up sticking in the kernel. I'm > trying to figure out which code we can more easily build upon in the > future. The fact that Oren's or yours can't do certain little things > right now does not bother me. > > Honestly, I'm a little more confident that everyone can work with Oren > since he managed to get 7 revisions of his patch out and make some > pretty large changes while in the same time the OpenVZ patch was only > released twice. I'm not sure what has changed in the OpenVZ patch > between releases, either. That is my fault. I am working right now on my Ph.D, that is why my activity is not very high. But now I hope I will have more time for that. > Are there any reasons that you absolutely can not use the code Oren > posted? Will it not fulfill your needs somehow? If so, could you > please elaborate on how? We have one major difference with Oren's code - how processes are created during restr. Right now I'm trying to port kernel process creation on top of Oren's patches. I agree that working in collaboration will speed up merging of checkpointing to mainstream. Andrey P.S.: Sorry for late reply, my mailer attached your e-mail to wrong thread. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-27 14:07 ` Andrey Mirkin @ 2008-10-27 14:39 ` Oren Laadan 2008-10-30 6:02 ` Andrey Mirkin 2008-11-03 19:35 ` Oren Laadan 1 sibling, 1 reply; 69+ messages in thread From: Oren Laadan @ 2008-10-27 14:39 UTC (permalink / raw) To: Andrey Mirkin; +Cc: Dave Hansen, containers, linux-kernel Andrey Mirkin wrote: > On Monday 20 October 2008 19:55 Dave Hansen wrote: >> On Mon, 2008-10-20 at 16:14 +0400, Andrey Mirkin wrote: >>> Right now my patchset (v2) provides an ability to checkpoint and restart >>> a group of processes. The process of checkpointing and restart can be >>> initiated from external process (not from the process which should be >>> checkpointed). >> Absolutely. Oren's code does it this way to make for a smaller patch at >> first. The syscall takes a pid argument so it is surely expected to be >> expanded upon later. >> >>> Also I think that all the restart job (including process forking) should >>> be done in kernel, as in this case we will not depend on user space and >>> will be more secure. This is also implemented in my patchset. >> Do you think that this is an approach that Oren's patches are married >> to, or is this a "feature" we can add on later? > > Well, AFAICS from Oren's patch set his approach is oriented on process > creation in user space. I think we should choose right now what approach will > be used for process creation. This is inaccurate. I intentionally did not address how processes will be created, by simply allowing either way to be added to the patch. I do agree that we probably want to decide how to do it. However, there is also room to allow for both approaches, in a compatible way, should we wish to explore both. > We have two options here: fork processes in kernel or fork them in user space. > If process will be forked in user space, then there will be a gap when process > will be in user space and can be killed with received signal before entering Why do we care about it ? Why is there a difference if it is killed before or after entering the kernel (e.g. user aborted restart, or kernel OOM kicked in) ? > kernel. Also we will need a functionolity to create processes with predefined > PID. I think it is not very good to provide such ability to user space. That > is why we prefer in OpenVZ to do all the job in kernel. This is the weak side of creating the processes in user space - that we need such an interface. Note, however, that we can easily "hide" it inside the interface of the sys_restart() call, and restrict how it may be used. Oren. > >> I don't care which patch set we end up sticking in the kernel. I'm >> trying to figure out which code we can more easily build upon in the >> future. The fact that Oren's or yours can't do certain little things >> right now does not bother me. >> >> Honestly, I'm a little more confident that everyone can work with Oren >> since he managed to get 7 revisions of his patch out and make some >> pretty large changes while in the same time the OpenVZ patch was only >> released twice. I'm not sure what has changed in the OpenVZ patch >> between releases, either. > > That is my fault. I am working right now on my Ph.D, that is why my activity > is not very high. But now I hope I will have more time for that. > >> Are there any reasons that you absolutely can not use the code Oren >> posted? Will it not fulfill your needs somehow? If so, could you >> please elaborate on how? > > We have one major difference with Oren's code - how processes are created > during restr. > Right now I'm trying to port kernel process creation on top of Oren's patches. > I agree that working in collaboration will speed up merging of checkpointing > to mainstream. > > Andrey > > P.S.: Sorry for late reply, my mailer attached your e-mail to wrong thread. > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/containers > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-27 14:39 ` Oren Laadan @ 2008-10-30 6:02 ` Andrey Mirkin 2008-10-30 11:47 ` Louis Rilling ` (2 more replies) 0 siblings, 3 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-10-30 6:02 UTC (permalink / raw) To: Oren Laadan, Dave Hansen, Serge E. Hallyn, Cedric Le Goater, Daniel Lezcano, Louis Rilling Cc: containers, linux-kernel On Monday 27 October 2008 17:39 Oren Laadan wrote: > Andrey Mirkin wrote: > > On Monday 20 October 2008 19:55 Dave Hansen wrote: > >> On Mon, 2008-10-20 at 16:14 +0400, Andrey Mirkin wrote: > >>> Right now my patchset (v2) provides an ability to checkpoint and > >>> restart a group of processes. The process of checkpointing and restart > >>> can be initiated from external process (not from the process which > >>> should be checkpointed). > >> > >> Absolutely. Oren's code does it this way to make for a smaller patch at > >> first. The syscall takes a pid argument so it is surely expected to be > >> expanded upon later. > >> > >>> Also I think that all the restart job (including process forking) > >>> should be done in kernel, as in this case we will not depend on user > >>> space and will be more secure. This is also implemented in my patchset. > >> > >> Do you think that this is an approach that Oren's patches are married > >> to, or is this a "feature" we can add on later? > > > > Well, AFAICS from Oren's patch set his approach is oriented on process > > creation in user space. I think we should choose right now what approach > > will be used for process creation. > > This is inaccurate. > > I intentionally did not address how processes will be created, by > simply allowing either way to be added to the patch. Yes, you right. Either way is possible with your patchset. But as I understand in ZAP you are using user space process creation. No? That is why I think that your design is more convenient for user process creation. > I do agree that we probably want to decide how to do it. However, > there is also room to allow for both approaches, in a compatible > way, should we wish to explore both. Yes, we can implement both approaches. Do you think we really need this? > > We have two options here: fork processes in kernel or fork them in user > > space. If process will be forked in user space, then there will be a gap > > when process will be in user space and can be killed with received signal > > before entering > > Why do we care about it ? > Why is there a difference if it is killed before or after entering > the kernel (e.g. user aborted restart, or kernel OOM kicked in) ? If one process is killed during restart then you can even do not notice that (if processes are created from user space and then call sys_restart). And you will get not the same state as before C/R. > > kernel. Also we will need a functionolity to create processes with > > predefined PID. I think it is not very good to provide such ability to > > user space. That is why we prefer in OpenVZ to do all the job in kernel. > > This is the weak side of creating the processes in user space - > that we need such an interface. Note, however, that we can > easily "hide" it inside the interface of the sys_restart() call, > and restrict how it may be used. Of course we can "hide" it somehow, but anyway we will have a hole and that is not good. Anyway we should ask everyone what they think about user- and kernel- based process creation. Dave, Serge, Cedric, Daniel, Louis what do you think about that? Andrey > >> I don't care which patch set we end up sticking in the kernel. I'm > >> trying to figure out which code we can more easily build upon in the > >> future. The fact that Oren's or yours can't do certain little things > >> right now does not bother me. > >> > >> Honestly, I'm a little more confident that everyone can work with Oren > >> since he managed to get 7 revisions of his patch out and make some > >> pretty large changes while in the same time the OpenVZ patch was only > >> released twice. I'm not sure what has changed in the OpenVZ patch > >> between releases, either. > > > > That is my fault. I am working right now on my Ph.D, that is why my > > activity is not very high. But now I hope I will have more time for that. > > > >> Are there any reasons that you absolutely can not use the code Oren > >> posted? Will it not fulfill your needs somehow? If so, could you > >> please elaborate on how? > > > > We have one major difference with Oren's code - how processes are created > > during restr. > > Right now I'm trying to port kernel process creation on top of Oren's > > patches. I agree that working in collaboration will speed up merging of > > checkpointing to mainstream. > > > > Andrey > > > > P.S.: Sorry for late reply, my mailer attached your e-mail to wrong > > thread. _______________________________________________ > > Containers mailing list > > Containers@lists.linux-foundation.org > > https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-30 6:02 ` Andrey Mirkin @ 2008-10-30 11:47 ` Louis Rilling 2008-10-30 17:08 ` Dave Hansen 2008-10-30 17:45 ` Oren Laadan 2008-10-30 14:08 ` Serge E. Hallyn 2008-10-30 17:03 ` Dave Hansen 2 siblings, 2 replies; 69+ messages in thread From: Louis Rilling @ 2008-10-30 11:47 UTC (permalink / raw) To: Andrey Mirkin Cc: Oren Laadan, Dave Hansen, Serge E. Hallyn, Cedric Le Goater, Daniel Lezcano, containers, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2328 bytes --] On Thu, Oct 30, 2008 at 10:02:44AM +0400, Andrey Mirkin wrote: > > > kernel. Also we will need a functionolity to create processes with > > > predefined PID. I think it is not very good to provide such ability to > > > user space. That is why we prefer in OpenVZ to do all the job in kernel. > > > > This is the weak side of creating the processes in user space - > > that we need such an interface. Note, however, that we can > > easily "hide" it inside the interface of the sys_restart() call, > > and restrict how it may be used. > > Of course we can "hide" it somehow, but anyway we will have a hole and that is > not good. > > Anyway we should ask everyone what they think about user- and kernel- based > process creation. > Dave, Serge, Cedric, Daniel, Louis what do you think about that? Frankly, I'm not convinced (yet) that one approach is better than the other one. I only *tend* to prefer kernel-based, for the reasons explained below. I know that there are arguments in favor of userspace (I've at least seen security-related ones), but I let their authors detail them (again). In Kerrighed this is kernel-based, and will remain kernel-based because we checkpoint a distributed task tree, and want to restart it as mush as possible with the same distribution. The distributed protocol used for restart is currently too fragile and complex to rely on customized user-space implementations. That said, if someone brings very good arguments in favor of userspace implementations, we might consider changing this. Without taking distributed restart into account, I also tend to prefer kernel-based, mainly for two (not so strong) reasons: 1) this prevents userspace from doing weird things, like changing the task tree and let the kernel detect it and deal with the mess this creates (think about two threads being restarted in separate processes that do not even share their parents). But one can argue that userspace can change the checkpoint image as well, so that the kernel must check for such weird things anyway. 2) restart will be more efficient with respect to shared objects. Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-30 11:47 ` Louis Rilling @ 2008-10-30 17:08 ` Dave Hansen 2008-10-30 18:01 ` Louis Rilling 2008-10-30 17:45 ` Oren Laadan 1 sibling, 1 reply; 69+ messages in thread From: Dave Hansen @ 2008-10-30 17:08 UTC (permalink / raw) To: Louis.Rilling Cc: Andrey Mirkin, containers, linux-kernel, Cedric Le Goater, Daniel Lezcano On Thu, 2008-10-30 at 12:47 +0100, Louis Rilling wrote: > 1) this prevents userspace from doing weird things, like changing the task tree > and let the kernel detect it and deal with the mess this creates (think about > two threads being restarted in separate processes that do not even share their > parents). But one can argue that userspace can change the checkpoint image as > well, so that the kernel must check for such weird things anyway. To me, this is one of the strongest arguments out there for doing restart as much as possible with existing user<->kernel APIs. Having the kernel detect and clean up userspace's messes is not going to work. We might as well just do things in the kernel rather than do that. What we *should* do is leverage all of the existing APIs that we already have instead of creating completely new code paths into which my butter fingers can introduce new kernel bugs. > 2) restart will be more efficient with respect to shared objects. Can you quantify this? Which objects? How much more efficient? -- Dave ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-30 17:08 ` Dave Hansen @ 2008-10-30 18:01 ` Louis Rilling 2008-10-30 18:28 ` Oren Laadan 0 siblings, 1 reply; 69+ messages in thread From: Louis Rilling @ 2008-10-30 18:01 UTC (permalink / raw) To: Dave Hansen Cc: Daniel Lezcano, containers, Cedric Le Goater, Andrey Mirkin, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3039 bytes --] On Thu, Oct 30, 2008 at 10:08:44AM -0700, Dave Hansen wrote: > On Thu, 2008-10-30 at 12:47 +0100, Louis Rilling wrote: > > 1) this prevents userspace from doing weird things, like changing the task tree > > and let the kernel detect it and deal with the mess this creates (think about > > two threads being restarted in separate processes that do not even share their > > parents). But one can argue that userspace can change the checkpoint image as > > well, so that the kernel must check for such weird things anyway. > > To me, this is one of the strongest arguments out there for doing > restart as much as possible with existing user<->kernel APIs. Having > the kernel detect and clean up userspace's messes is not going to work. > We might as well just do things in the kernel rather than do that. > > What we *should* do is leverage all of the existing APIs that we already > have instead of creating completely new code paths into which my butter > fingers can introduce new kernel bugs. > > > 2) restart will be more efficient with respect to shared objects. > > Can you quantify this? Which objects? How much more efficient? Quantify? No. I expect that investigating both approaches will show us numbers. Unless Oren already has some? Which objects? I think that two kinds will especially matter: objects usually shared only inside a thread group (mm_struct, fs_struct, files_struct, signal_struct and sighand_struct), and individual file descriptors. The point is to avoid creating new structures before destroying them because the restarted task shares them with a previously restarted one. Concerning individual file descriptors, limiting the number of open files before calling sys_restart() may avoid these useless creations/destructions (actually the "useless" work mainly consists in managing ref counts since file descriptors are shared after fork()). Concerning thread-shared structures, it is probably easy for userspace to guess which clone flags to use when restarting threads, but 1) kernel-space will have to check that the sharing is correct anyway, and 2) kernel-space will have to fix it anyway if structures are not shared in an obvious manner between tasks (think about A creating B with shared files_struct, B creating C with shared files_struct, B unsharing its files_struct, and then checkpoint). So, with a userspace implementation, useless structures will be created anyway, and optimizing the common cases (regular threads) just duplicates kernel's work of checking which shared structure to use for each task to restart. With a kernel-space implementation, all useless creations can be avoided, and no duplicate work is needed. That said, numbers may show us that useless creations are not so time-consuming, but we won't know before seeing them... Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-30 18:01 ` Louis Rilling @ 2008-10-30 18:28 ` Oren Laadan 0 siblings, 0 replies; 69+ messages in thread From: Oren Laadan @ 2008-10-30 18:28 UTC (permalink / raw) To: Louis.Rilling Cc: Dave Hansen, linux-kernel, Daniel Lezcano, Cedric Le Goater, containers, Andrey Mirkin Louis Rilling wrote: > On Thu, Oct 30, 2008 at 10:08:44AM -0700, Dave Hansen wrote: >> On Thu, 2008-10-30 at 12:47 +0100, Louis Rilling wrote: >>> 1) this prevents userspace from doing weird things, like changing the task tree >>> and let the kernel detect it and deal with the mess this creates (think about >>> two threads being restarted in separate processes that do not even share their >>> parents). But one can argue that userspace can change the checkpoint image as >>> well, so that the kernel must check for such weird things anyway. >> To me, this is one of the strongest arguments out there for doing >> restart as much as possible with existing user<->kernel APIs. Having >> the kernel detect and clean up userspace's messes is not going to work. >> We might as well just do things in the kernel rather than do that. >> >> What we *should* do is leverage all of the existing APIs that we already >> have instead of creating completely new code paths into which my butter >> fingers can introduce new kernel bugs. >> >>> 2) restart will be more efficient with respect to shared objects. >> Can you quantify this? Which objects? How much more efficient? > > Quantify? No. I expect that investigating both approaches will show us numbers. > Unless Oren already has some? I do have some. it's pretty quick :) see the usenix 2007 paper... the new implementation will be faster, though. > > Which objects? I think that two kinds will especially matter: objects usually > shared only inside a thread group (mm_struct, fs_struct, files_struct, > signal_struct and sighand_struct), and individual file descriptors. The point is > to avoid creating new structures before destroying them because the restarted > task shares them with a previously restarted one. all the forks in the user space will be done with CLONE_VM etc, to avoid exactly that sort of overhead. in any event, my experience is that this is not the dominant factor in the restart time. > > Concerning individual file descriptors, limiting the number of open files before > calling sys_restart() may avoid these useless creations/destructions (actually > the "useless" work mainly consists in managing ref counts since file descriptors > are shared after fork()). > > Concerning thread-shared structures, it is probably easy for userspace to guess > which clone flags to use when restarting threads, but > 1) kernel-space will have to check that the sharing is correct anyway, and ok. that's not a lot of work :p (see more below) > 2) kernel-space will have to fix it anyway if structures are not shared in an > obvious manner between tasks (think about A creating B with shared files_struct, > B creating C with shared files_struct, B unsharing its files_struct, and then > checkpoint). > > So, with a userspace implementation, useless structures will be created anyway, > and optimizing the common cases (regular threads) just duplicates kernel's work > of checking which shared structure to use for each task to restart. > With a kernel-space implementation, all useless creations can be avoided, and no > duplicate work is needed. they can also be avoided in user space - you "optimistically" create everything shared to begin with, and in the kernel (inside sys_restart) you "unshare" and create the necessary resources on demand - just like you would do with kernel based process creation. in this case, the extra work is only ref-counting, and then sys_restart will unconditionally attach the right shared resource to the restarting process (the "right" shared resource will be found, of course, in the shared pool). this way, you don't even need to check what the user gave you - you simply ignore overwrite it. > > That said, numbers may show us that useless creations are not so > time-consuming, but we won't know before seeing them... yes, odds are that you are right. Oren ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-30 11:47 ` Louis Rilling 2008-10-30 17:08 ` Dave Hansen @ 2008-10-30 17:45 ` Oren Laadan 2008-10-30 18:14 ` Louis Rilling 1 sibling, 1 reply; 69+ messages in thread From: Oren Laadan @ 2008-10-30 17:45 UTC (permalink / raw) To: Louis.Rilling Cc: Andrey Mirkin, Dave Hansen, Serge E. Hallyn, Cedric Le Goater, Daniel Lezcano, containers, linux-kernel Louis Rilling wrote: > On Thu, Oct 30, 2008 at 10:02:44AM +0400, Andrey Mirkin wrote: >>>> kernel. Also we will need a functionolity to create processes with >>>> predefined PID. I think it is not very good to provide such ability to >>>> user space. That is why we prefer in OpenVZ to do all the job in kernel. >>> This is the weak side of creating the processes in user space - >>> that we need such an interface. Note, however, that we can >>> easily "hide" it inside the interface of the sys_restart() call, >>> and restrict how it may be used. >> Of course we can "hide" it somehow, but anyway we will have a hole and that is >> not good. >> >> Anyway we should ask everyone what they think about user- and kernel- based >> process creation. >> Dave, Serge, Cedric, Daniel, Louis what do you think about that? > > Frankly, I'm not convinced (yet) that one approach is better than the other one. > I only *tend* to prefer kernel-based, for the reasons explained below. I know > that there are arguments in favor of userspace (I've at least seen > security-related ones), but I let their authors detail them (again). I'm not convinced either. I think both implementation can eventually work well. > > In Kerrighed this is kernel-based, and will remain kernel-based because we > checkpoint a distributed task tree, and want to restart it as mush as possible > with the same distribution. The distributed protocol used for restart is > currently too fragile and complex to rely on customized user-space > implementations. That said, if someone brings very good arguments in favor of > userspace implementations, we might consider changing this. Zap also has distributed checkpoint which does not require strict kernel-side ordering. Do you need that because you do SSI ? > > Without taking distributed restart into account, I also tend to prefer > kernel-based, mainly for two (not so strong) reasons: > 1) this prevents userspace from doing weird things, like changing the task tree > and let the kernel detect it and deal with the mess this creates (think about > two threads being restarted in separate processes that do not even share their > parents). But one can argue that userspace can change the checkpoint image as > well, so that the kernel must check for such weird things anyway. I don't really buy this argument. First, as you say, user can change the checkpoint image file. Second, you can verify in the kernel that the real relationships of the processes match those specified (and expected from) the image file. That's pretty straightforward. > 2) restart will be more efficient with respect to shared objects. Can you elaborate on this ? In what sense "more efficient" ? Note that the topic in question is not whether to do the entire restart from user space (and I argue that most work should be done in the kernel), but rather whether process creation (and only that) should be done in kernel or user space. Quick thoughts of pros/cons of each approach are: user space: + re-use existing api (fork) + easier to debug + will allow 'handmade' resources restart: it was mentioned before that one may want to reattach stdout to a different place after restart; a user based restart of processes can make this much easier: e.g. the user process can create the alternative resources, give them to the kernel and only then call sys_restart) + arch-independent code - a bit slower than in kernel space - requires a clone-with-specific-pid syscall or interface kernel space: + a bit easier to control everything + a bit faster than user space + no need for user-visible interface for clone-with-... - arch-dependent code - needs special code to fight 'fork-bomb' So, I'm not convinced, and I even think there may be room to both, for the time being. I volunteer to support the user-space alternative while we make up our minds. Oren. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-30 17:45 ` Oren Laadan @ 2008-10-30 18:14 ` Louis Rilling 2008-10-30 18:32 ` Oren Laadan 0 siblings, 1 reply; 69+ messages in thread From: Louis Rilling @ 2008-10-30 18:14 UTC (permalink / raw) To: Oren Laadan Cc: Andrey Mirkin, Dave Hansen, Serge E. Hallyn, Cedric Le Goater, Daniel Lezcano, containers, linux-kernel [-- Attachment #1: Type: text/plain, Size: 3710 bytes --] On Thu, Oct 30, 2008 at 01:45:25PM -0400, Oren Laadan wrote: > > > Louis Rilling wrote: > > In Kerrighed this is kernel-based, and will remain kernel-based because we > > checkpoint a distributed task tree, and want to restart it as mush as possible > > with the same distribution. The distributed protocol used for restart is > > currently too fragile and complex to rely on customized user-space > > implementations. That said, if someone brings very good arguments in favor of > > userspace implementations, we might consider changing this. > > Zap also has distributed checkpoint which does not require strict > kernel-side ordering. Do you need that because you do SSI ? Yes. Tasks from different nodes have parent-children, session leader, etc. relationships, and the distributed management of struct pid lifecycle is a bit touchy too. By the way, splitting the checkpoint image in one file for each task helps us a lot to make restart parallel, because it is more efficient for the file system to handle parallel reads of different files from different nodes than parallel reads on a single file descriptor from different nodes. > > > > > Without taking distributed restart into account, I also tend to prefer > > kernel-based, mainly for two (not so strong) reasons: > > 1) this prevents userspace from doing weird things, like changing the task tree > > and let the kernel detect it and deal with the mess this creates (think about > > two threads being restarted in separate processes that do not even share their > > parents). But one can argue that userspace can change the checkpoint image as > > well, so that the kernel must check for such weird things anyway. > > I don't really buy this argument. First, as you say, user can change > the checkpoint image file. Second, you can verify in the kernel that > the real relationships of the processes match those specified (and > expected from) the image file. That's pretty straightforward. > > > 2) restart will be more efficient with respect to shared objects. > > Can you elaborate on this ? In what sense "more efficient" ? > > Note that the topic in question is not whether to do the entire restart > from user space (and I argue that most work should be done in the kernel), > but rather whether process creation (and only that) should be done in > kernel or user space. See my answer to Dave. > > Quick thoughts of pros/cons of each approach are: > > user space: > > + re-use existing api (fork) > + easier to debug > + will allow 'handmade' resources restart: it was mentioned before that > one may want to reattach stdout to a different place after restart; a > user based restart of processes can make this much easier: e.g. the > user process can create the alternative resources, give them to the > kernel and only then call sys_restart) > + arch-independent code > > - a bit slower than in kernel space > - requires a clone-with-specific-pid syscall or interface > > kernel space: > > + a bit easier to control everything > + a bit faster than user space > + no need for user-visible interface for clone-with-... > > - arch-dependent code > - needs special code to fight 'fork-bomb' > > So, I'm not convinced, and I even think there may be room to both, for > the time being. I volunteer to support the user-space alternative while > we make up our minds. Yes, I hope that investigating both approaches will give us stronger arguments. Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-30 18:14 ` Louis Rilling @ 2008-10-30 18:32 ` Oren Laadan 2008-10-31 10:37 ` Louis Rilling 0 siblings, 1 reply; 69+ messages in thread From: Oren Laadan @ 2008-10-30 18:32 UTC (permalink / raw) To: Louis.Rilling Cc: Andrey Mirkin, Dave Hansen, Serge E. Hallyn, Cedric Le Goater, Daniel Lezcano, containers, linux-kernel Louis Rilling wrote: > On Thu, Oct 30, 2008 at 01:45:25PM -0400, Oren Laadan wrote: >> >> Louis Rilling wrote: >>> In Kerrighed this is kernel-based, and will remain kernel-based because we >>> checkpoint a distributed task tree, and want to restart it as mush as possible >>> with the same distribution. The distributed protocol used for restart is >>> currently too fragile and complex to rely on customized user-space >>> implementations. That said, if someone brings very good arguments in favor of >>> userspace implementations, we might consider changing this. >> Zap also has distributed checkpoint which does not require strict >> kernel-side ordering. Do you need that because you do SSI ? > > Yes. Tasks from different nodes have parent-children, session leader, etc. > relationships, and the distributed management of struct pid lifecycle is a bit > touchy too. By the way, splitting the checkpoint image in one file for each task > helps us a lot to make restart parallel, because it is more efficient for the file > system to handle parallel reads of different files from different nodes than > parallel reads on a single file descriptor from different nodes. You can also make parallel restart work with the single stream, without much effort. Particularly if you store everything on the file system. In both cases, the limiting factor is shared resources - where one task cannot proceed with checkpoint because it waits for another task to first (re)create that resource. [...] Oren. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-30 18:32 ` Oren Laadan @ 2008-10-31 10:37 ` Louis Rilling 0 siblings, 0 replies; 69+ messages in thread From: Louis Rilling @ 2008-10-31 10:37 UTC (permalink / raw) To: Oren Laadan Cc: Andrey Mirkin, Dave Hansen, Serge E. Hallyn, Cedric Le Goater, Daniel Lezcano, containers, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2354 bytes --] On Thu, Oct 30, 2008 at 02:32:51PM -0400, Oren Laadan wrote: > > > Louis Rilling wrote: > > On Thu, Oct 30, 2008 at 01:45:25PM -0400, Oren Laadan wrote: > >> > >> Louis Rilling wrote: > >>> In Kerrighed this is kernel-based, and will remain kernel-based because we > >>> checkpoint a distributed task tree, and want to restart it as mush as possible > >>> with the same distribution. The distributed protocol used for restart is > >>> currently too fragile and complex to rely on customized user-space > >>> implementations. That said, if someone brings very good arguments in favor of > >>> userspace implementations, we might consider changing this. > >> Zap also has distributed checkpoint which does not require strict > >> kernel-side ordering. Do you need that because you do SSI ? > > > > Yes. Tasks from different nodes have parent-children, session leader, etc. > > relationships, and the distributed management of struct pid lifecycle is a bit > > touchy too. By the way, splitting the checkpoint image in one file for each task > > helps us a lot to make restart parallel, because it is more efficient for the file > > system to handle parallel reads of different files from different nodes than > > parallel reads on a single file descriptor from different nodes. > > You can also make parallel restart work with the single stream, without > much effort. Particularly if you store everything on the file system. Sure we can use a single stream, since we already share file descriptors accross nodes. But the distributed synchronization of the file pointer is costly compared to having each node access different files. This way we push the parallelization bottelneck down to the file system rather than in the distributed VFS layer. > > In both cases, the limiting factor is shared resources - where one task > cannot proceed with checkpoint because it waits for another task to first > (re)create that resource. We just try to avoid other bottlenecks :) And besides file descriptors, shared resources are as common as multi-threaded programs, which are not the majority of the workloads we can address. Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-30 6:02 ` Andrey Mirkin 2008-10-30 11:47 ` Louis Rilling @ 2008-10-30 14:08 ` Serge E. Hallyn 2008-10-30 17:03 ` Dave Hansen 2 siblings, 0 replies; 69+ messages in thread From: Serge E. Hallyn @ 2008-10-30 14:08 UTC (permalink / raw) To: Andrey Mirkin Cc: Oren Laadan, Dave Hansen, Cedric Le Goater, Daniel Lezcano, Louis Rilling, containers, linux-kernel Quoting Andrey Mirkin (major@openvz.org): > Anyway we should ask everyone what they think about user- and kernel- based > process creation. > Dave, Serge, Cedric, Daniel, Louis what do you think about that? I prefer kernel. Pretty sure Dave prefers user-space. I'd say the thing to do is push the core API that supports single-thread c/r. Then try to push the patch to do recreate processes from the kernel, and let the arguments for and against that stand on their own. -serge ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-30 6:02 ` Andrey Mirkin 2008-10-30 11:47 ` Louis Rilling 2008-10-30 14:08 ` Serge E. Hallyn @ 2008-10-30 17:03 ` Dave Hansen 2 siblings, 0 replies; 69+ messages in thread From: Dave Hansen @ 2008-10-30 17:03 UTC (permalink / raw) To: Andrey Mirkin Cc: Oren Laadan, Serge E. Hallyn, Cedric Le Goater, Daniel Lezcano, Louis Rilling, containers, linux-kernel On Thu, 2008-10-30 at 10:02 +0400, Andrey Mirkin wrote: > Anyway we should ask everyone what they think about user- and kernel- based > process creation. > Dave, Serge, Cedric, Daniel, Louis what do you think about that? My worry is where a single sys_restart() plus in-kernel process creation takes us. In practice, what do we do? Do we single-thread the entire restore process? Or, do we do in-kernel process creation and have multiple kernel threads trying to read out of different points in the checkpoint file, trying to restore all their own states in parallel? Does that mean that we can't in practice restore from a fd like a pipe or a network socket? In the same way, if we *do* create the processes in userspace, how do we do _that_? Do we just fork() and sleep() until the kernel comes along and blows our state away? How does the kernel process doing the restoring tell userspace how many things to fork? How do we match these new userspace processes up with the ones coming out of the checkpoint process? To me, it's just way too early to talk about this stuff. Both approaches have their issues, and I'm yet to see the differences manifested in code so I can really sink my teeth into them -- Dave ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-27 14:07 ` Andrey Mirkin 2008-10-27 14:39 ` Oren Laadan @ 2008-11-03 19:35 ` Oren Laadan 1 sibling, 0 replies; 69+ messages in thread From: Oren Laadan @ 2008-11-03 19:35 UTC (permalink / raw) To: Andrey Mirkin; +Cc: Dave Hansen, containers, linux-kernel Andrey Mirkin wrote: > On Monday 20 October 2008 19:55 Dave Hansen wrote: >> On Mon, 2008-10-20 at 16:14 +0400, Andrey Mirkin wrote: >>> Right now my patchset (v2) provides an ability to checkpoint and restart >>> a group of processes. The process of checkpointing and restart can be >>> initiated from external process (not from the process which should be >>> checkpointed). >> Absolutely. Oren's code does it this way to make for a smaller patch at >> first. The syscall takes a pid argument so it is surely expected to be >> expanded upon later. >> >>> Also I think that all the restart job (including process forking) should >>> be done in kernel, as in this case we will not depend on user space and >>> will be more secure. This is also implemented in my patchset. >> Do you think that this is an approach that Oren's patches are married >> to, or is this a "feature" we can add on later? > > Well, AFAICS from Oren's patch set his approach is oriented on process > creation in user space. I think we should choose right now what approach will > be used for process creation. > We have two options here: fork processes in kernel or fork them in user space. > If process will be forked in user space, then there will be a gap when process > will be in user space and can be killed with received signal before entering > kernel. Also we will need a functionolity to create processes with predefined > PID. I think it is not very good to provide such ability to user space. That Rethinking this -- if the user wishes she can construct a suitable checkpoint image that would do exactly that. It takes more effort than using a system call, but the result is similar. What I had in mind for that special clone-with-pid is to restrict when it can be used (e.g. when the container is in a "restarting" state or something like that. [...] Oren. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 12:14 ` [Devel] " Andrey Mirkin 2008-10-20 15:55 ` Dave Hansen @ 2008-10-20 17:17 ` Oren Laadan 2008-10-27 14:38 ` Andrey Mirkin 1 sibling, 1 reply; 69+ messages in thread From: Oren Laadan @ 2008-10-20 17:17 UTC (permalink / raw) To: Andrey Mirkin Cc: devel, containers, Pavel Emelyanov, linux-kernel, Dave Hansen Andrey Mirkin wrote: > On Saturday 18 October 2008 03:33 Dave Hansen wrote: >> On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: >>> This patchset introduces kernel based checkpointing/restart as it is >>> implemented in OpenVZ project. This patchset has limited functionality >>> and are able to checkpoint/restart only single process. Recently Oren >>> Laaden sent another kernel based implementation of checkpoint/restart. >>> The main differences between this patchset and Oren's patchset are: >> Hi Andrey, >> >> I'm curious what you want to happen with this patch set. Is there >> something specific in Oren's set that deficient which you need >> implemented? Are there some technical reasons you prefer this code? > > Hi Dave, > > Right now my patchset (v2) provides an ability to checkpoint and restart a > group of processes. The process of checkpointing and restart can be initiated > from external process (not from the process which should be checkpointed). Both patchsets share the same design, namely be able to checkpoint and restart multiple processes, with the operation initiated by an external processes. I deliberately left out the part that handles multiple processes to keeps things simple for initial review, and until we decide on the question of kernel- or user- based process creation on restart. > Also I think that all the restart job (including process forking) should be > done in kernel, as in this case we will not depend on user space and will be > more secure. This is also implemented in my patchset. I'm not convinced that creating the processes in the kernel makes it more secure. Can you elaborate ? for the discussion, let's compare these two basic scenarios: 1) container and processes are created in user space; each process calls "sys_restart()" which eventually calls "do_restart()" that does kernel-based restart. 2) container and processes are created in kernel space; each process calls "do_restart()" to do kernel-based restart. In fact, creating in user based makes it easier to enforce capabilities and limits of the user. It also simplifies the debugging significantly, and allows us to delegate the entire issue of containers and namespace management back to user space, where it probably belongs. On the other hand, doing it in kernel space likely to produce simpler code for the creation of the processes. Thanks, Oren. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [Devel] Re: [PATCH 0/9] OpenVZ kernel based checkpointing/restart 2008-10-20 17:17 ` Oren Laadan @ 2008-10-27 14:38 ` Andrey Mirkin 0 siblings, 0 replies; 69+ messages in thread From: Andrey Mirkin @ 2008-10-27 14:38 UTC (permalink / raw) To: Oren Laadan; +Cc: devel, containers, linux-kernel, Dave Hansen On Monday 20 October 2008 21:17 Oren Laadan wrote: > Andrey Mirkin wrote: > > On Saturday 18 October 2008 03:33 Dave Hansen wrote: > >> On Wed, 2008-09-03 at 14:57 +0400, Andrey Mirkin wrote: > >>> This patchset introduces kernel based checkpointing/restart as it is > >>> implemented in OpenVZ project. This patchset has limited functionality > >>> and are able to checkpoint/restart only single process. Recently Oren > >>> Laaden sent another kernel based implementation of checkpoint/restart. > >>> The main differences between this patchset and Oren's patchset are: > >> > >> Hi Andrey, > >> > >> I'm curious what you want to happen with this patch set. Is there > >> something specific in Oren's set that deficient which you need > >> implemented? Are there some technical reasons you prefer this code? > > > > Hi Dave, > > > > Right now my patchset (v2) provides an ability to checkpoint and restart > > a group of processes. The process of checkpointing and restart can be > > initiated from external process (not from the process which should be > > checkpointed). > > Both patchsets share the same design, namely be able to checkpoint and > restart multiple processes, with the operation initiated by an external > processes. > > I deliberately left out the part that handles multiple processes to > keeps things simple for initial review, and until we decide on the > question of kernel- or user- based process creation on restart. I agree that multiple process handling is not needed for initial review. But I believe that the question with process creation should be discussed right now. > > Also I think that all the restart job (including process forking) should > > be done in kernel, as in this case we will not depend on user space and > > will be more secure. This is also implemented in my patchset. > > I'm not convinced that creating the processes in the kernel makes it > more secure. Can you elaborate ? for the discussion, let's compare > these two basic scenarios: > > 1) container and processes are created in user space; each process > calls "sys_restart()" which eventually calls "do_restart()" that > does kernel-based restart. Well, in this case there will be a gap after process is returned from fork but before entering kernel. During that time process can be killed by delivered signal. Another drawback of this approach is that we will need to provide an ability for user to create processes with predefined PID. > 2) container and processes are created in kernel space; each process > calls "do_restart()" to do kernel-based restart. > > In fact, creating in user based makes it easier to enforce capabilities > and limits of the user. It also simplifies the debugging significantly, > and allows us to delegate the entire issue of containers and namespace > management back to user space, where it probably belongs. > > On the other hand, doing it in kernel space likely to produce simpler > code for the creation of the processes. You right here. Both approaches have pros and cons, but I think that kernel approach has more advantages Andrey ^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2008-11-03 19:36 UTC | newest] Thread overview: 69+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-09-03 10:57 [PATCH 0/9] OpenVZ kernel based checkpointing/restart Andrey Mirkin 2008-09-03 10:57 ` [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls Andrey Mirkin 2008-09-03 10:57 ` [PATCH 2/9] Make checkpoint/restart functionality modular Andrey Mirkin 2008-09-03 10:57 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Andrey Mirkin 2008-09-03 10:57 ` [PATCH 4/9] Introduce container dump function Andrey Mirkin 2008-09-03 10:57 ` [PATCH 5/9] Introduce function to dump process Andrey Mirkin 2008-09-03 10:57 ` [PATCH 6/9] Introduce functions to dump mm Andrey Mirkin 2008-09-03 10:57 ` [PATCH 7/9] Introduce function for restarting a container Andrey Mirkin 2008-09-03 10:57 ` [PATCH 8/9] Introduce functions to restart a process Andrey Mirkin 2008-09-03 10:57 ` [PATCH 9/9] Introduce functions to restore mm Andrey Mirkin 2008-09-03 14:32 ` [PATCH 8/9] Introduce functions to restart a process Louis Rilling 2008-09-13 17:34 ` Pavel Machek 2008-09-03 14:17 ` [PATCH 6/9] Introduce functions to dump mm Louis Rilling 2008-09-03 14:23 ` [PATCH 4/9] Introduce container dump function Serge E. Hallyn 2008-09-03 14:45 ` Andrey Mirkin 2008-09-03 12:29 ` [PATCH 3/9] Introduce context structure needed during checkpointing/restart Matthieu Fertré 2008-09-03 14:11 ` Andrey Mirkin 2008-09-03 13:56 ` Louis Rilling 2008-09-03 14:07 ` Andrey Mirkin 2008-09-03 14:13 ` Cedric Le Goater 2008-09-03 14:29 ` Andrey Mirkin 2008-09-03 14:27 ` [PATCH 2/9] Make checkpoint/restart functionality modular Serge E. Hallyn 2008-09-03 14:51 ` Andrey Mirkin 2008-09-03 11:44 ` [PATCH 1/9] Introduce trivial sys_checkpoint and sys_restore system calls Cedric Le Goater 2008-09-03 13:05 ` [Devel] " Andrey Mirkin 2008-09-03 12:28 ` [PATCH 0/9] OpenVZ kernel based checkpointing/restart Cedric Le Goater 2008-09-03 13:59 ` [Devel] " Andrey Mirkin 2008-09-04 22:55 ` Dave Hansen 2008-09-03 14:18 ` Serge E. Hallyn 2008-09-03 13:49 ` Louis Rilling 2008-09-03 14:06 ` Louis Rilling 2008-09-03 14:19 ` Andrey Mirkin 2008-09-03 14:26 ` Cedric Le Goater 2008-09-03 14:53 ` Andrey Mirkin 2008-09-04 8:14 ` Oren Laadan 2008-09-04 14:05 ` Dave Hansen 2008-10-17 23:33 ` Dave Hansen 2008-10-20 11:10 ` Louis Rilling 2008-10-20 13:25 ` Daniel Lezcano 2008-10-20 13:48 ` Cedric Le Goater 2008-10-20 13:49 ` Daniel Lezcano 2008-10-20 15:53 ` Oren Laadan 2008-10-20 16:37 ` Daniel Lezcano 2008-10-20 17:23 ` Serge E. Hallyn 2008-10-21 0:18 ` Oren Laadan 2008-10-21 0:58 ` Serge E. Hallyn 2008-10-21 13:24 ` Daniel Lezcano 2008-10-27 14:45 ` [Devel] " Andrey Mirkin 2008-10-20 16:51 ` Serge E. Hallyn 2008-10-21 9:36 ` Cedric Le Goater 2008-10-20 16:36 ` Dave Hansen 2008-10-20 12:14 ` [Devel] " Andrey Mirkin 2008-10-20 15:55 ` Dave Hansen 2008-10-27 14:07 ` Andrey Mirkin 2008-10-27 14:39 ` Oren Laadan 2008-10-30 6:02 ` Andrey Mirkin 2008-10-30 11:47 ` Louis Rilling 2008-10-30 17:08 ` Dave Hansen 2008-10-30 18:01 ` Louis Rilling 2008-10-30 18:28 ` Oren Laadan 2008-10-30 17:45 ` Oren Laadan 2008-10-30 18:14 ` Louis Rilling 2008-10-30 18:32 ` Oren Laadan 2008-10-31 10:37 ` Louis Rilling 2008-10-30 14:08 ` Serge E. Hallyn 2008-10-30 17:03 ` Dave Hansen 2008-11-03 19:35 ` Oren Laadan 2008-10-20 17:17 ` Oren Laadan 2008-10-27 14:38 ` Andrey Mirkin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).