* [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] <20080807224033.FFB3A2C1@kernel> @ 2008-08-07 22:40 ` Dave Hansen 2008-08-07 22:40 ` [RFC][PATCH 2/4] checkpoint/restart: x86 support Dave Hansen ` (7 subsequent siblings) 8 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-07 22:40 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> This patch adds those interfaces, as well as all of the helpers needed to easily manage the file format. The code is roughly broken out as follows: ckpt/sys.c - user/kernel data transfer, as well as setting up of the checkpoint/restart context (a per-checkpoint data structure for housekeeping) ckpt/checkpoint.c - output wrappers and basic checkpoint handling ckpt/restart.c - input wrappers and basic restart handling Patches to add the per-architecture support as well as the actual work to do the memory checkpoint follow in subsequent patches. Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- linux-2.6.git-dave/Makefile | 2 linux-2.6.git-dave/ckpt/Makefile | 1 linux-2.6.git-dave/ckpt/checkpoint.c | 207 +++++++++++++++++++++++++++++++ linux-2.6.git-dave/ckpt/ckpt.h | 82 ++++++++++++ linux-2.6.git-dave/ckpt/ckpt_hdr.h | 69 ++++++++++ linux-2.6.git-dave/ckpt/restart.c | 189 ++++++++++++++++++++++++++++ linux-2.6.git-dave/ckpt/sys.c | 233 +++++++++++++++++++++++++++++++++++ 7 files changed, 782 insertions(+), 1 deletion(-) diff -puN /dev/null ckpt/checkpoint.c --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/checkpoint.c 2008-08-07 15:37:22.000000000 -0700 @@ -0,0 +1,207 @@ +/* + * Checkpoint logic and helpers + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <linux/version.h> +#include <linux/sched.h> +#include <linux/time.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/dcache.h> +#include <linux/mount.h> +#include <asm/ptrace.h> + +#include "ckpt.h" +#include "ckpt_hdr.h" + +/** + * cr_get_fname - return pathname of a given file + * @file: file pointer + * @buf: buffer for pathname + * @n: buffer length (in) and pathname length (out) + * + * if the buffer provivded by the caller is too small, allocate a new + * buffer; caller should call cr_put_pathname() for cleanup + */ +char *cr_get_fname(struct path *path, struct path *root, char *buf, int *n) +{ + char *fname; + + fname = __d_path(path, root, buf, *n); + + if (IS_ERR(fname) && PTR_ERR(fname) == -ENAMETOOLONG) { + if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0))) + return ERR_PTR(-ENOMEM); + fname = __d_path(path, root, buf, PAGE_SIZE); + if (IS_ERR(fname)) + free_pages((unsigned long) buf, 0); + } + if (!IS_ERR(fname)) + *n = (buf + *n - fname); + + return fname; +} + +/** + * cr_put_fname - (possibly) cleanup pathname buffer + * @buf: original buffer that was given to cr_get_pathname() + * @fname: resulting pathname from cr_get_pathname() + * @n: length of original buffer + */ +void cr_put_fname(char *buf, char *fname, int n) +{ + if (fname && (fname < buf || fname >= buf + n)) + free_pages((unsigned long) buf, 0); +} + +/** + * cr_write_obj - write a record described by a cr_hdr + * @ctx: checkpoint context + * @h: record descriptor + * @buf: record buffer + */ +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf) +{ + int ret; + + if ((ret = cr_kwrite(ctx, h, sizeof(*h))) < 0) + return ret; + return cr_kwrite(ctx, buf, h->len); +} + +/** + * cr_write_str - write a string record + * @ctx: checkpoint context + * @str: string buffer + * @n: string length + */ +int cr_write_str(struct cr_ctx *ctx, char *str, int n) +{ + struct cr_hdr h; + + h.type = CR_HDR_STR; + h.len = n; + h.id = 0; + + return cr_write_obj(ctx, &h, str); +} + +/* write the checkpoint header */ +static int cr_write_hdr(struct cr_ctx *ctx) +{ + struct cr_hdr h; + struct cr_hdr_head *hh = ctx->tbuf; + struct timeval ktv; + + h.type = CR_HDR_HEAD; + h.len = sizeof(*hh); + h.id = 0; + + do_gettimeofday(&ktv); + + hh->magic = 0x00a2d200; + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff; + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff; + hh->patch = (LINUX_VERSION_CODE) & 0xff; + + hh->version = 1; + + hh->flags = ctx->flags; + hh->time = ktv.tv_sec; + + return cr_write_obj(ctx, &h, hh); +} + +/* write the checkpoint trailer */ +static int cr_write_tail(struct cr_ctx *ctx) +{ + struct cr_hdr h; + struct cr_hdr_tail *hh = ctx->tbuf; + + h.type = CR_HDR_TAIL; + h.len = sizeof(*hh); + h.id = 0; + + hh->magic = 0x002d2a00; + hh->cksum[0] = hh->cksum[1] = 1; /* TBD ... */ + + return cr_write_obj(ctx, &h, hh); +} + +/* dump the task_struct of a given task */ +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t) +{ + struct cr_hdr h; + struct cr_hdr_task *hh = ctx->tbuf; + + h.type = CR_HDR_TASK; + h.len = sizeof(*hh); + h.id = ctx->pid; + + hh->state = t->state; + hh->exit_state = t->exit_state; + hh->exit_code = t->exit_code; + hh->exit_signal = t->exit_signal; + + hh->pid = t->pid; + hh->tgid = t->tgid; + + hh->utime = t->utime; + hh->stime = t->stime; + hh->utimescaled = t->utimescaled; + hh->stimescaled = t->stimescaled; + hh->gtime = t->gtime; + hh->prev_utime = t->prev_utime; + hh->prev_stime = t->prev_stime; + hh->nvcsw = t->nvcsw; + hh->nivcsw = t->nivcsw; + hh->start_time_sec = t->start_time.tv_sec; + hh->start_time_nsec = t->start_time.tv_nsec; + hh->real_start_time_sec = t->real_start_time.tv_sec; + hh->real_start_time_nsec = t->real_start_time.tv_nsec; + hh->min_flt = t->min_flt; + hh->maj_flt = t->maj_flt; + + hh->task_comm_len = TASK_COMM_LEN; + memcpy(hh->comm, t->comm, TASK_COMM_LEN); + + return cr_write_obj(ctx, &h, hh); +} + +/* dump the entire state of a given task */ +static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t) +{ + int ret ; + + BUG_ON(t->state == TASK_DEAD); + + ret = cr_write_task_struct(ctx, t); + CR_PRINTK("ret (task_struct) %d\n", ret); + + return ret; +} + +int do_checkpoint(struct cr_ctx *ctx) +{ + int ret; + + /* FIX: need to test whether container is checkpointable */ + + ret = cr_write_hdr(ctx); + if (!ret) + ret = cr_write_task(ctx, current); + if (!ret) + ret = cr_write_tail(ctx); + + /* on success, return (unique) checkpoint identifier */ + if (!ret) + ret = ctx->crid; + + return ret; +} diff -puN /dev/null ckpt/ckpt.h --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/ckpt.h 2008-08-05 09:04:27.000000000 -0700 @@ -0,0 +1,82 @@ +#ifndef _CKPT_CKPT_H_ +#define _CKPT_CKPT_H_ +/* + * Generic container checkpoint-restart + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <linux/path.h> +#include <linux/fs.h> + +struct cr_pgarr; + +struct cr_ctx { + pid_t pid; /* container identifier */ + int crid; /* unique checkpoint id */ + + unsigned long flags; + unsigned long oflags; /* restart: old flags */ + + struct file *file; + int total; /* total read/written */ + + void *tbuf; /* temp: to avoid many alloc/dealloc */ + void *hbuf; /* header: to avoid many alloc/dealloc */ + int hpos; + + struct cr_pgarr *pgarr; + struct cr_pgarr *pgcur; + + struct path *vfsroot; /* container root */ +}; + +/* cr_ctx: flags */ +#define CR_CTX_CKPT 0x1 +#define CR_CTX_RSTR 0x2 + +/* allocation defaults */ +#define CR_ORDER_TBUF 1 +#define CR_ORDER_HBUF 1 + +#define CR_TBUF_TOTAL ((PAGE_SIZE << CR_ORDER_TBUF) / sizeof(void *)) +#define CR_HBUF_TOTAL ((PAGE_SIZE << CR_ORDER_HBUF) / sizeof(void *)) + +extern void cr_put_fname(char *buf, char *fname, int n); +extern char *cr_get_fname(struct path *path, struct path *root, char *buf, int *n); + +extern int cr_uwrite(struct cr_ctx *ctx, void *buf, int count); +extern int cr_kwrite(struct cr_ctx *ctx, void *buf, int count); +extern int cr_uread(struct cr_ctx *ctx, void *buf, int count); +extern int cr_kread(struct cr_ctx *ctx, void *buf, int count); + +extern void *cr_hbuf_get(struct cr_ctx *ctx, int n); +extern void cr_hbuf_put(struct cr_ctx *ctx, int n); + +struct cr_hdr; + +extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf); +extern int cr_write_str(struct cr_ctx *ctx, char *str, int n); +extern int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t); + +extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n); +extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type); +extern int cr_read_str(struct cr_ctx *ctx, void *str, int n); +extern int cr_read_mm(struct cr_ctx *ctx); + +extern int do_checkpoint(struct cr_ctx *ctx); +extern int do_restart(struct cr_ctx *ctx); + +/* debugging */ +#if 0 +#define CR_PRINTK(str, args...) \ + printk(KERN_ERR "cr@%s#%d: " str, __func__, __LINE__, ##args) +#else +#define CR_PRINTK(...) do {} while (0) +#endif + +#endif /* _CKPT_CKPT_H_ */ diff -puN /dev/null ckpt/ckpt_hdr.h --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-07 15:37:22.000000000 -0700 @@ -0,0 +1,69 @@ +/* + * Generic container checkpoint-restart + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <linux/types.h> + +struct cr_hdr { + __s16 type; + __s16 len; + __u32 id; +}; + +enum { + CR_HDR_HEAD = 1, + CR_HDR_STR, + + CR_HDR_TASK = 101, + CR_HDR_THREAD, + CR_HDR_CPU, + + CR_HDR_MM = 201, + CR_HDR_VMA, + CR_HDR_MM_CONTEXT, + + CR_HDR_TAIL = 5001 +}; + +struct cr_hdr_head { + __u32 magic; + __u16 major; + __u16 minor; + __u16 patch; + __u16 version; + __u32 flags; /* checkpoint options */ + __u64 time; /* when checkpoint taken */ +}; + +struct cr_hdr_tail { + __u32 magic; + __u32 cksum[2]; +}; + +struct cr_hdr_task { + __u64 state; + __u32 exit_state; + __u32 exit_code, exit_signal; + + __u16 pid; + __u16 tgid; + + __u64 utime, stime, utimescaled, stimescaled; + __u64 gtime; + __u64 prev_utime, prev_stime; + __u64 nvcsw, nivcsw; + __u64 start_time_sec, start_time_nsec; + __u64 real_start_time_sec, real_start_time_nsec; + __u64 min_flt, maj_flt; + + __s16 task_comm_len; + char comm[TASK_COMM_LEN]; +}; + + diff -puN /dev/null ckpt/Makefile --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/Makefile 2008-08-07 15:37:22.000000000 -0700 @@ -0,0 +1 @@ +obj-y += sys.o checkpoint.o restart.o diff -puN /dev/null ckpt/restart.c --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/restart.c 2008-08-07 15:37:22.000000000 -0700 @@ -0,0 +1,189 @@ +/* + * Restart logic and helpers + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +/* + * During restart the code reads in data from the chekcpoint image into a + * temporary buffer (ctx->hbuf). Because operations can be nested, one + * should call cr_hbuf_get() to reserve space in the buffer, and then + * cr_hbuf_put() when it no longer needs that space + */ + +#include <linux/version.h> +#include <linux/sched.h> +#include <linux/file.h> + +#include "ckpt.h" +#include "ckpt_hdr.h" + +/** + * cr_hbuf_get - reserve space on the hbuf + * @ctx: checkpoint context + * @n: number of bytes to reserve + */ +void *cr_hbuf_get(struct cr_ctx *ctx, int n) +{ + void *ptr; + + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL); + ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos); + ctx->hpos += n; + return ptr; +} + +/** + * cr_hbuf_put - unreserve space on the hbuf + * @ctx: checkpoint context + * @n: number of bytes to reserve + */ +void cr_hbuf_put(struct cr_ctx *ctx, int n) +{ + BUG_ON(ctx->hpos < n); + ctx->hpos -= n; +} + +/** + * cr_read_obj - read a whole record (cr_hdr followed by payload) + * @ctx: checkpoint context + * @h: record descriptor + * @buf: record buffer + * @n: available buffer size + */ +int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n) +{ + int ret; + + ret = cr_kread(ctx, h, sizeof(*h)); + if (ret < 0) + return ret; + + CR_PRINTK("type %d len %d id %d (%d)\n", h->type, h->len, h->id, n); + if (h->len < 0 || h->len > n) + return -EINVAL; + + return cr_kread(ctx, buf, h->len); +} + +/** + * cr_read_obj_type - read a whole record of expected type + * @ctx: checkpoint context + * @buf: record buffer + * @n: available buffer size + * @type: expected record type + */ +int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type) +{ + struct cr_hdr h; + int ret; + + ret = cr_read_obj(ctx, &h, buf, n); + if (!ret) + ret = (h.type == type ? h.id : -EINVAL); + return ret; +} + +/** + * cr_read_str - read a string record + * @ctx: checkpoint context + * @str: string buffer + * @n: string length + */ +int cr_read_str(struct cr_ctx *ctx, void *str, int n) +{ + return cr_read_obj_type(ctx, str, n, CR_HDR_STR); +} + +/* read the checkpoint header */ +static int cr_read_hdr(struct cr_ctx *ctx) +{ + struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh)); + int ret; + + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD); + if (ret < 0) + return ret; + + if (hh->magic != 0x00a2d200 || hh->version != 1 || + hh->major != ((LINUX_VERSION_CODE >> 16) & 0xff) || + hh->minor != ((LINUX_VERSION_CODE >> 8) & 0xff) || + hh->patch != ((LINUX_VERSION_CODE) & 0xff)) + return -EINVAL; + + if (hh->flags & ~CR_CTX_CKPT) + return -EINVAL; + + ctx->oflags = hh->flags; + + cr_hbuf_put(ctx, sizeof(*hh)); + return 0; +} + +/* read the checkpoint trailer */ +static int cr_read_tail(struct cr_ctx *ctx) +{ + struct cr_hdr_tail *hh = cr_hbuf_get(ctx, sizeof(*hh)); + int ret; + + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TAIL); + if (ret < 0) + return ret; + + if (hh->magic != 0x002d2a00 || + hh->cksum[0] != 1 || hh->cksum[1] != 1) + return -EINVAL; + + cr_hbuf_put(ctx, sizeof(*hh)); + return 0; +} + +/* read the task_struct into the current task */ +static int cr_read_task_struct(struct cr_ctx *ctx) +{ + struct cr_hdr_task *hh = cr_hbuf_get(ctx, sizeof(*hh)); + struct task_struct *t = current; + int ret; + + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_TASK); + if (ret < 0) + return ret; + + /* for now, only restore t->comm */ + if (hh->task_comm_len < 0 || hh->task_comm_len > TASK_COMM_LEN) + return -EINVAL; + + memset(t->comm, 0, TASK_COMM_LEN); + memcpy(t->comm, hh->comm, hh->task_comm_len); + + cr_hbuf_put(ctx, sizeof(*hh)); + return 0; +} + +/* read the entire state of the current task */ +static int cr_read_task(struct cr_ctx *ctx) +{ + int ret; + + ret = cr_read_task_struct(ctx); + CR_PRINTK("ret (task_struct) %d\n", ret); + + return ret; +} + +int do_restart(struct cr_ctx *ctx) +{ + int ret; + + ret = cr_read_hdr(ctx); + if (!ret) + ret = cr_read_task(ctx); + if (!ret) + ret = cr_read_tail(ctx); + + return ret; +} diff -puN /dev/null ckpt/sys.c --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/sys.c 2008-08-07 15:37:22.000000000 -0700 @@ -0,0 +1,233 @@ +/* + * Generic container checkpoint-restart + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <linux/sched.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/uaccess.h> +#include <linux/capability.h> + +#include "ckpt.h" + +/* + * helpers to write/read to/from the image file descriptor + * + * cr_uwrite() - write a user-space buffer to the checkpoint image + * cr_kwrite() - write a kernel-space buffer to the checkpoint image + * cr_uread() - read from the checkpoint image to a user-space buffer + * cr_kread() - read from the checkpoint image to a kernel-space buffer + * + */ + +/* (temporarily added file_pos_read() and file_pos_write() because they + * are static in fs/read_write.c... should cleanup and remove later) */ +static inline loff_t file_pos_read(struct file *file) +{ + return file->f_pos; +} + +static inline void file_pos_write(struct file *file, loff_t pos) +{ + file->f_pos = pos; +} + +int cr_uwrite(struct cr_ctx *ctx, void *buf, int count) +{ + struct file *file = ctx->file; + ssize_t nwrite; + int nleft; + + for (nleft = count; nleft; nleft -= nwrite) { + loff_t pos = file_pos_read(file); + nwrite = vfs_write(file, (char __user *) buf, nleft, &pos); + file_pos_write(file, pos); + if (unlikely(nwrite <= 0)) /* zero tolerance */ + return (nwrite ? : -EIO); + buf += nwrite; + } + + ctx->total += count; + return 0; +} + +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count) +{ + mm_segment_t oldfs; + int ret; + + oldfs = get_fs(); + set_fs(KERNEL_DS); + ret = cr_uwrite(ctx, buf, count); + set_fs(oldfs); + + return ret; +} + +int cr_uread(struct cr_ctx *ctx, void *buf, int count) +{ + struct file *file = ctx->file; + ssize_t nread; + int nleft; + + for (nleft = count; nleft; nleft -= nread) { + loff_t pos = file_pos_read(file); + nread = vfs_read(file, (char __user *) buf, nleft, &pos); + file_pos_write(file, pos); + if (unlikely(nread <= 0)) /* zero tolerance */ + return (nread ? : -EIO); + buf += nread; + } + + ctx->total += count; + return 0; +} + +int cr_kread(struct cr_ctx *ctx, void *buf, int count) +{ + mm_segment_t oldfs; + int ret; + + oldfs = get_fs(); + set_fs(KERNEL_DS); + ret = cr_uread(ctx, buf, count); + set_fs(oldfs); + + return ret; +} + + +/* + * helpers to manage CR contexts: allocated for each checkpoint and/or + * restart operation, and persists until the operation is completed. + */ + +static atomic_t cr_ctx_count; /* unique checkpoint identifier */ + +void cr_ctx_free(struct cr_ctx *ctx) +{ + + if (ctx->file) + fput(ctx->file); + if (ctx->vfsroot) + path_put(ctx->vfsroot); + + free_pages((unsigned long) ctx->tbuf, CR_ORDER_TBUF); + free_pages((unsigned long) ctx->hbuf, CR_ORDER_HBUF); + + kfree(ctx); +} + +struct cr_ctx *cr_ctx_alloc(pid_t pid, struct file *file, unsigned long flags) +{ + struct cr_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + ctx->tbuf = (void *) __get_free_pages(GFP_KERNEL, CR_ORDER_TBUF); + ctx->hbuf = (void *) __get_free_pages(GFP_KERNEL, CR_ORDER_HBUF); + if (!ctx->tbuf || !ctx->hbuf) + goto nomem; + + ctx->pid = pid; + ctx->flags = flags; + + ctx->file = file; + get_file(file); + + /* assume checkpointer is in container's root vfs */ + ctx->vfsroot = ¤t->fs->root; + path_get(ctx->vfsroot); + + ctx->crid = atomic_inc_return(&cr_ctx_count); + + return ctx; + + nomem: + cr_ctx_free(ctx); + return NULL; +} + +/** + * sys_checkpoint - checkpoint a container + * @pid: pid of the container init(1) process + * @fd: file to which dump the checkpoint image + * @flags: checkpoint operation flags + */ +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags) +{ + struct cr_ctx *ctx; + struct file *file; + int fput_needed; + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + file = fget_light(fd, &fput_needed); + if (!file) + return -EBADF; + + /* no flags for now */ + if (flags) + return -EINVAL; + + ctx = cr_ctx_alloc(pid, file, flags | CR_CTX_CKPT); + if (!ctx) { + fput_light(file, fput_needed); + return -ENOMEM; + } + + ret = do_checkpoint(ctx); + + cr_ctx_free(ctx); + fput_light(file, fput_needed); + CR_PRINTK("ckpt retval = %d\n", ret); + return ret; +} + +/** + * sys_restart - restart a container + * @crid: checkpoint image identifier + * @fd: file from which read the checkpoint image + * @flags: restart operation flags + */ +asmlinkage long sys_restart(int crid, int fd, unsigned long flags) +{ + struct cr_ctx *ctx; + struct file *file; + int fput_needed; + int ret; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + file = fget_light(fd, &fput_needed); + if (!file) + return -EBADF; + + /* no flags for now */ + if (flags) + return -EINVAL; + + ctx = cr_ctx_alloc(crid, file, flags | CR_CTX_RSTR); + if (!ctx) { + fput_light(file, fput_needed); + return -ENOMEM; + } + + ret = do_restart(ctx); + + cr_ctx_free(ctx); + fput_light(file, fput_needed); + CR_PRINTK("restart retval = %d\n", ret); + return ret; +} diff -puN Makefile~handle_a_single_task_with_private_memory_maps Makefile --- linux-2.6.git/Makefile~handle_a_single_task_with_private_memory_maps 2008-08-05 09:04:27.000000000 -0700 +++ linux-2.6.git-dave/Makefile 2008-08-05 09:07:53.000000000 -0700 @@ -611,7 +611,7 @@ export mod_strip_cmd ifeq ($(KBUILD_EXTMOD),) -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/ vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ _ ^ permalink raw reply [flat|nested] 71+ messages in thread
* [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] <20080807224033.FFB3A2C1@kernel> 2008-08-07 22:40 ` [RFC][PATCH 1/4] checkpoint-restart: general infrastructure Dave Hansen @ 2008-08-07 22:40 ` Dave Hansen 2008-08-07 22:40 ` [RFC][PATCH 3/4] checkpoint/restart: memory management Dave Hansen ` (6 subsequent siblings) 8 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-07 22:40 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen The original version of Oren's patch contained a good hunk of #ifdefs. I've extracted all of those and created a bit of an API for new architectures to follow. Leaving Oren's sign-off because this is all still his code, even though he hasn't seen it mangled like this before. Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- linux-2.6.git-dave/ckpt/Makefile | 1 linux-2.6.git-dave/ckpt/checkpoint.c | 7 linux-2.6.git-dave/ckpt/ckpt_arch.h | 6 linux-2.6.git-dave/ckpt/restart.c | 7 linux-2.6.git-dave/ckpt/x86.c | 269 ++++++++++++++++++++++++++++++ linux-2.6.git-dave/include/asm-x86/ckpt.h | 46 +++++ 6 files changed, 336 insertions(+) diff -puN ckpt/checkpoint.c~x86_part ckpt/checkpoint.c --- linux-2.6.git/ckpt/checkpoint.c~x86_part 2008-08-04 13:29:59.000000000 -0700 +++ linux-2.6.git-dave/ckpt/checkpoint.c 2008-08-04 13:29:59.000000000 -0700 @@ -19,6 +19,7 @@ #include "ckpt.h" #include "ckpt_hdr.h" +#include "ckpt_arch.h" /** * cr_get_fname - return pathname of a given file @@ -183,6 +184,12 @@ static int cr_write_task(struct cr_ctx * ret = cr_write_task_struct(ctx, t); CR_PRINTK("ret (task_struct) %d\n", ret); + if (!ret) + ret = cr_write_thread(ctx, t); + CR_PRINTK("ret (thread) %d\n", ret); + if (!ret) + ret = cr_write_cpu(ctx, t); + CR_PRINTK("ret (cpu) %d\n", ret); return ret; } diff -puN /dev/null ckpt/ckpt_arch.h --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_arch.h 2008-08-04 13:29:59.000000000 -0700 @@ -0,0 +1,6 @@ +#include "ckpt.h" + +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t); +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t); +int cr_read_thread(struct cr_ctx *ctx); +int cr_read_cpu(struct cr_ctx *ctx); diff -puN ckpt/Makefile~x86_part ckpt/Makefile --- linux-2.6.git/ckpt/Makefile~x86_part 2008-08-04 13:29:59.000000000 -0700 +++ linux-2.6.git-dave/ckpt/Makefile 2008-08-04 13:29:59.000000000 -0700 @@ -1 +1,2 @@ obj-y += sys.o checkpoint.o restart.o +obj-$(CONFIG_X86) += x86.o diff -puN ckpt/restart.c~x86_part ckpt/restart.c --- linux-2.6.git/ckpt/restart.c~x86_part 2008-08-04 13:29:59.000000000 -0700 +++ linux-2.6.git-dave/ckpt/restart.c 2008-08-04 13:29:59.000000000 -0700 @@ -21,6 +21,7 @@ #include "ckpt.h" #include "ckpt_hdr.h" +#include "ckpt_arch.h" /** * cr_hbuf_get - reserve space on the hbuf @@ -171,6 +172,12 @@ static int cr_read_task(struct cr_ctx *c ret = cr_read_task_struct(ctx); CR_PRINTK("ret (task_struct) %d\n", ret); + if (!ret) + ret = cr_read_thread(ctx); + CR_PRINTK("ret (thread) %d\n", ret); + if (!ret) + ret = cr_read_cpu(ctx); + CR_PRINTK("ret (cpu) %d\n", ret); return ret; } diff -puN /dev/null ckpt/x86.c --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/x86.c 2008-08-04 13:29:59.000000000 -0700 @@ -0,0 +1,269 @@ +#include <asm/ckpt.h> +#include <asm/desc.h> +#include <asm/i387.h> + +#include "ckpt.h" +#include "ckpt_hdr.h" + +/* dump the thread_struct of a given task */ +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t) +{ + struct cr_hdr h; + struct cr_hdr_thread *hh = ctx->tbuf; + struct thread_struct *thread; + struct desc_struct *desc; + int ntls = 0; + int n, ret; + + h.type = CR_HDR_THREAD; + h.len = sizeof(*hh); + h.id = ctx->pid; + + thread = &t->thread; + + /* calculate no. of TLS entries that follow */ + desc = thread->tls_array; + for (n = GDT_ENTRY_TLS_ENTRIES; n > 0; n--, desc++) { + if (desc->a || desc->b) + ntls++; + } + + hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES; + hh->sizeof_tls_array = sizeof(thread->tls_array); + hh->ntls = ntls; + + if ((ret = cr_write_obj(ctx, &h, hh)) < 0) + return ret; + + /* for simplicity dump the entire array, cherry-pick upon restart */ + ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array)); + + CR_PRINTK("ntls %d\n", ntls); + + /* IGNORE RESTART BLOCKS FOR NOW ... */ + + return ret; +} + +/* dump the cpu state and registers of a given task */ +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t) +{ + struct cr_hdr h; + struct cr_hdr_cpu *hh = ctx->tbuf; + struct thread_struct *thread; + struct thread_info *thread_info; + struct pt_regs *regs; + + h.type = CR_HDR_CPU; + h.len = sizeof(*hh); + h.id = ctx->pid; + + thread = &t->thread; + thread_info = task_thread_info(t); + regs = task_pt_regs(t); + + hh->bx = regs->bx; + hh->cx = regs->cx; + hh->dx = regs->dx; + hh->si = regs->si; + hh->di = regs->di; + hh->bp = regs->bp; + hh->ax = regs->ax; + hh->ds = regs->ds; + hh->es = regs->es; + hh->orig_ax = regs->orig_ax; + hh->ip = regs->ip; + hh->cs = regs->cs; + hh->flags = regs->flags; + hh->sp = regs->sp; + hh->ss = regs->ss; + + /* for checkpoint in process context (from within a container) + the GS and FS registers should be saved from the hardware; + otherwise they are already sabed on the thread structure */ + if (t == current) { + savesegment(gs, hh->gs); + savesegment(fs, hh->fs); + } else { + hh->gs = thread->gs; + hh->fs = thread->fs; + } + + /* + * for checkpoint in process context (from within a container), + * the actual syscall is taking place at this very moment; so + * we (optimistically) subtitute the future return value (0) of + * this syscall into the orig_eax, so that upon restart it will + * succeed (or it will endlessly retry checkpoint...) + */ + if (t == current) { + BUG_ON(hh->orig_ax < 0); + hh->ax = 0; + } + + preempt_disable(); + + /* i387 + MMU + SSE logic */ + hh->used_math = tsk_used_math(t) ? 1 : 0; + if (hh->used_math) { + /* normally, no need to unlazy_fpu(), since TS_USEDFPU flag + * have been cleared when task was conexted-switched out... + * except if we are in process context, in which case we do */ + if (thread_info->status & TS_USEDFPU) + unlazy_fpu(current); + + hh->has_fxsr = cpu_has_fxsr; + memcpy(&hh->xstate, &thread->xstate, sizeof(thread->xstate)); + } + + /* debug regs */ + + /* + * for checkpoint in process context (from within a container), + * get the actual registers; otherwise get the saved values. + */ + if (t == current) { + get_debugreg(hh->debugreg0, 0); + get_debugreg(hh->debugreg1, 1); + get_debugreg(hh->debugreg2, 2); + get_debugreg(hh->debugreg3, 3); + get_debugreg(hh->debugreg6, 6); + get_debugreg(hh->debugreg7, 7); + } else { + hh->debugreg0 = thread->debugreg0; + hh->debugreg1 = thread->debugreg1; + hh->debugreg2 = thread->debugreg2; + hh->debugreg3 = thread->debugreg3; + hh->debugreg6 = thread->debugreg6; + hh->debugreg7 = thread->debugreg7; + } + + hh->uses_debug = !!(thread_info->flags & TIF_DEBUG); + + preempt_enable(); + + CR_PRINTK("math %d debug %d\n", hh->used_math, hh->uses_debug); + + return cr_write_obj(ctx, &h, hh); +} + +/* read the thread_struct into the current task */ +int cr_read_thread(struct cr_ctx *ctx) +{ + struct cr_hdr_thread *hh = cr_hbuf_get(ctx, sizeof(*hh)); + struct task_struct *t = current; + struct thread_struct *thread = &t->thread; + int ret; + + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_THREAD); + if (ret < 0) + return ret; + + CR_PRINTK("ntls %d\n", hh->ntls); + + if (hh->gdt_entry_tls_entries != GDT_ENTRY_TLS_ENTRIES || + hh->sizeof_tls_array != sizeof(thread->tls_array) || + hh->ntls < 0 || hh->ntls > GDT_ENTRY_TLS_ENTRIES) + return -EINVAL; + + if (hh->ntls > 0) { + + /* restore TLS by hand: why convert to struct user_desc if + * sys_set_thread_entry() will convert it back ? */ + + struct desc_struct *buf = ctx->tbuf; + int size = sizeof(*buf) * GDT_ENTRY_TLS_ENTRIES; + int cpu; + + BUG_ON(size > CR_TBUF_TOTAL); + + ret = cr_kread(ctx, buf, size); + if (ret < 0) + return ret; + + /* FIX: add sanity checks (eg. that values makes sense, that + * that we don't overwrite old values, etc */ + + cpu = get_cpu(); + memcpy(thread->tls_array, buf, size); + load_TLS(thread, cpu); + put_cpu(); + } + + return 0; +} + +/* read the cpu state nad registers for the current task */ +int cr_read_cpu(struct cr_ctx *ctx) +{ + struct cr_hdr_cpu *hh = cr_hbuf_get(ctx, sizeof(*hh)); + struct task_struct *t = current; + struct thread_struct *thread; + struct thread_info *thread_info; + struct pt_regs *regs; + int ret; + + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_CPU); + if (ret < 0) + return ret; + + /* FIX: sanity check for sensitive registers (eg. eflags) */ + + thread = &t->thread; + thread_info = task_thread_info(t); + regs = task_pt_regs(t); + + regs->bx = hh->bx; + regs->cx = hh->cx; + regs->dx = hh->dx; + regs->si = hh->si; + regs->di = hh->di; + regs->bp = hh->bp; + regs->ax = hh->ax; + regs->ds = hh->ds; + regs->es = hh->es; + regs->orig_ax = hh->orig_ax; + regs->ip = hh->ip; + regs->cs = hh->cs; + regs->flags = hh->flags; + regs->sp = hh->sp; + regs->ss = hh->ss; + + thread->gs = hh->gs; + thread->fs = hh->fs; + loadsegment(gs, hh->gs); + loadsegment(fs, hh->fs); + + CR_PRINTK("math %d debug %d\n", hh->used_math, hh->uses_debug); + + /* FIX: this should work ... (someone double check !) */ + + preempt_disable(); + + /* i387 + MMU + SSE */ + __clear_fpu(t); /* in case we used FPU in user mode */ + if (!hh->used_math) + clear_used_math(); + else { + if (hh->has_fxsr != cpu_has_fxsr) { + force_sig(SIGFPE, t); + return -EINVAL; + } + memcpy(&thread->xstate, &hh->xstate, sizeof(thread->xstate)); + set_used_math(); + } + + /* debug regs */ + if (hh->uses_debug) { + set_debugreg(hh->debugreg0, 0); + set_debugreg(hh->debugreg1, 1); + set_debugreg(hh->debugreg2, 2); + set_debugreg(hh->debugreg3, 3); + set_debugreg(hh->debugreg6, 6); + set_debugreg(hh->debugreg7, 7); + } + + preempt_enable(); + + return 0; +} diff -puN /dev/null include/asm-x86/ckpt.h --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/include/asm-x86/ckpt.h 2008-08-04 13:29:59.000000000 -0700 @@ -0,0 +1,46 @@ +#ifndef __ASM_X86_CKPT_H +#define __ASM_X86_CKPT_H + +#include <asm/processor.h> + +struct cr_hdr_thread { + /* NEED: restart blocks */ + __s16 gdt_entry_tls_entries; + __s16 sizeof_tls_array; + __s16 ntls; /* number of TLS entries to follow */ +}; + +struct cr_hdr_cpu { + __u64 bx; + __u64 cx; + __u64 dx; + __u64 si; + __u64 di; + __u64 bp; + __u64 ax; + __u64 ds; + __u64 es; + __u64 orig_ax; + __u64 ip; + __u64 cs; + __u64 flags; + __u64 sp; + __u64 ss; + __u64 fs; + __u64 gs; + + __u64 debugreg0; + __u64 debugreg1; + __u64 debugreg2; + __u64 debugreg3; + __u64 debugreg6; + __u64 debugreg7; + + __u8 uses_debug; + + __u8 used_math; + __u8 has_fxsr; + union thread_xstate xstate; /* i387 */ +}; + +#endif /* __ASM_X86_CKPT_H */ _ ^ permalink raw reply [flat|nested] 71+ messages in thread
* [RFC][PATCH 3/4] checkpoint/restart: memory management [not found] <20080807224033.FFB3A2C1@kernel> 2008-08-07 22:40 ` [RFC][PATCH 1/4] checkpoint-restart: general infrastructure Dave Hansen 2008-08-07 22:40 ` [RFC][PATCH 2/4] checkpoint/restart: x86 support Dave Hansen @ 2008-08-07 22:40 ` Dave Hansen 2008-08-07 22:40 ` [RFC][PATCH 4/4] introduce sys_checkpoint and sys_restore Dave Hansen ` (5 subsequent siblings) 8 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-07 22:40 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen For each vma, there is a 'struct cr_vma'; if the vma is file-mapped, it will be followed by the file name. The cr_vma->npages will tell how many pages were dumped for this vma. Then it will be followed by the actual data: first a dump of the addresses of all dumped pages (npages entries) followed by a dump of the contents of all dumped pages (npages pages). Then will come the next vma and so on. I guess I could also separate out the x86-specific bits here, but they're pretty small, comparatively. Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- linux-2.6.git-dave/arch/x86/kernel/ldt.c | 2 linux-2.6.git-dave/ckpt/Makefile | 2 linux-2.6.git-dave/ckpt/ckpt_arch.h | 2 linux-2.6.git-dave/ckpt/ckpt_hdr.h | 21 + linux-2.6.git-dave/ckpt/ckpt_mem.c | 388 ++++++++++++++++++++++++++++++ linux-2.6.git-dave/ckpt/ckpt_mem.h | 32 ++ linux-2.6.git-dave/ckpt/rstr_mem.c | 354 +++++++++++++++++++++++++++ linux-2.6.git-dave/ckpt/sys.c | 3 linux-2.6.git-dave/ckpt/x86.c | 83 ++++++ linux-2.6.git-dave/include/asm-x86/ckpt.h | 5 linux-2.6.git-dave/include/asm-x86/desc.h | 3 11 files changed, 892 insertions(+), 3 deletions(-) diff -puN arch/x86/kernel/ldt.c~memory_part arch/x86/kernel/ldt.c --- linux-2.6.git/arch/x86/kernel/ldt.c~memory_part 2008-08-05 08:37:29.000000000 -0700 +++ linux-2.6.git-dave/arch/x86/kernel/ldt.c 2008-08-05 08:38:00.000000000 -0700 @@ -183,7 +183,7 @@ static int read_default_ldt(void __user return bytecount; } -static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode) +int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode) { struct mm_struct *mm = current->mm; struct desc_struct ldt; diff -puN ckpt/ckpt_arch.h~memory_part ckpt/ckpt_arch.h --- linux-2.6.git/ckpt/ckpt_arch.h~memory_part 2008-08-05 08:37:29.000000000 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_arch.h 2008-08-05 08:37:29.000000000 -0700 @@ -4,3 +4,5 @@ int cr_write_thread(struct cr_ctx *ctx, int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t); int cr_read_thread(struct cr_ctx *ctx); int cr_read_cpu(struct cr_ctx *ctx); +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm); +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm); diff -puN ckpt/ckpt_hdr.h~memory_part ckpt/ckpt_hdr.h --- linux-2.6.git/ckpt/ckpt_hdr.h~memory_part 2008-08-05 08:37:29.000000000 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-05 08:37:29.000000000 -0700 @@ -67,3 +67,24 @@ struct cr_hdr_task { }; + +struct cr_hdr_mm { + __u32 tag; /* sharing identifier */ + __u64 start_code, end_code, start_data, end_data; + __u64 start_brk, brk, start_stack; + __u64 arg_start, arg_end, env_start, env_end; + __s16 map_count; +}; + +struct cr_hdr_vma { + __u32 how; + + __u64 vm_start; + __u64 vm_end; + __u64 vm_page_prot; + __u64 vm_flags; + __u64 vm_pgoff; + + __s16 npages; + __s16 namelen; +}; diff -puN /dev/null ckpt/ckpt_mem.c --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_mem.c 2008-08-05 08:37:29.000000000 -0700 @@ -0,0 +1,388 @@ +/* + * Checkpoint memory contents + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <linux/sched.h> +#include <linux/slab.h> +#include <linux/file.h> +#include <linux/pagemap.h> +#include <linux/mm_types.h> + +#include "ckpt.h" +#include "ckpt_hdr.h" +#include "ckpt_arch.h" +#include "ckpt_mem.h" + +/* + * utilities to alloc, free, and handle 'struct cr_pgarr' + * (common to ckpt_mem.c and rstr_mem.c) + */ + +#define CR_ORDER_PGARR 0 +#define CR_PGARR_TOTAL ((PAGE_SIZE << CR_ORDER_PGARR) / sizeof(void *)) + +/* release pages referenced by a page-array */ +void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr) +{ + int n; + + /* only checkpoint keeps references to pages */ + if (ctx->flags & CR_CTX_CKPT) { + CR_PRINTK("release pages (nused %d)\n", pgarr->nused); + for (n = pgarr->nused; n--; ) + page_cache_release(pgarr->pages[n]); + } + pgarr->nused = 0; + pgarr->nleft = CR_PGARR_TOTAL; +} + +/* release pages referenced by chain of page-arrays */ +void cr_pgarr_release(struct cr_ctx *ctx) +{ + struct cr_pgarr *pgarr; + + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) + _cr_pgarr_release(ctx, pgarr); +} + +/* free a chain of page-arrays */ +void cr_pgarr_free(struct cr_ctx *ctx) +{ + struct cr_pgarr *pgarr, *pgnxt; + + for (pgarr = ctx->pgarr; pgarr; pgarr = pgnxt) { + _cr_pgarr_release(ctx, pgarr); + free_pages((unsigned long) ctx->pgarr->addrs, CR_ORDER_PGARR); + free_pages((unsigned long) ctx->pgarr->pages, CR_ORDER_PGARR); + pgnxt = pgarr->next; + kfree(pgarr); + } +} + +/* allocate and add a new page-array to chain */ +struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx, struct cr_pgarr **pgnew) +{ + struct cr_pgarr *pgarr = ctx->pgcur; + + if (pgarr && pgarr->next) { + ctx->pgcur = pgarr->next; + return pgarr->next; + } + + if ((pgarr = kzalloc(sizeof(*pgarr), GFP_KERNEL))) { + pgarr->nused = 0; + pgarr->nleft = CR_PGARR_TOTAL; + pgarr->addrs = (unsigned long *) + __get_free_pages(GFP_KERNEL, CR_ORDER_PGARR); + pgarr->pages = (struct page **) + __get_free_pages(GFP_KERNEL, CR_ORDER_PGARR); + if (likely(pgarr->addrs && pgarr->pages)) { + *pgnew = pgarr; + ctx->pgcur = pgarr; + return pgarr; + } else if (pgarr->addrs) + free_pages((unsigned long) pgarr->addrs, + CR_ORDER_PGARR); + kfree(pgarr); + } + + return NULL; +} + +/* return current page-array (and allocate if needed) */ +struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx) +{ + struct cr_pgarr *pgarr = ctx->pgcur; + + if (unlikely(!pgarr->nleft)) + pgarr = cr_pgarr_alloc(ctx, &pgarr->next); + return pgarr; +} + +/* + * Checkpoint is outside the context of the checkpointee, so one cannot + * simply read pages from user-space. Instead, we scan the address space + * of the target to cherry-pick pages of interest. Selected pages are + * enlisted in a page-array chain (attached to the checkpoint context). + * To save their contents, each page is mapped to kernel memory and then + * dumped to the file descriptor. + */ + +/** + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma + * @ctx - checkpoint context + * @pgarr - page-array to fill + * @vma - vma to scan + * @start - start address (updated) + */ +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr, + struct vm_area_struct *vma, unsigned long *start) +{ + unsigned long end = vma->vm_end; + unsigned long addr = *start; + struct page **pagep; + unsigned long *addrp; + int cow, nr, ret = 0; + + nr = pgarr->nleft; + pagep = &pgarr->pages[pgarr->nused]; + addrp = &pgarr->addrs[pgarr->nused]; + cow = !!vma->vm_file; + + while (addr < end) { + struct page *page; + + /* simplified version of get_user_pages(): already have vma, + * only need FOLL_TOUCH, and (for now) ignore fault stats */ + + cond_resched(); + while (!(page = follow_page(vma, addr, FOLL_TOUCH))) { + ret = handle_mm_fault(vma->vm_mm, vma, addr, 0); + if (ret & VM_FAULT_ERROR) { + if (ret & VM_FAULT_OOM) + ret = -ENOMEM; + else if (ret & VM_FAULT_SIGBUS) + ret = -EFAULT; + else + BUG(); + break; + } + cond_resched(); + } + + if (IS_ERR(page)) { + ret = PTR_ERR(page); + break; + } + + if (page == ZERO_PAGE(0)) + page = NULL; /* zero page: ignore */ + else if (cow && page_mapping(page) != NULL) + page = NULL; /* clean cow: ignore */ + else { + get_page(page); + *(addrp++) = addr; + *(pagep++) = page; + if (--nr == 0) { + addr += PAGE_SIZE; + break; + } + } + + addr += PAGE_SIZE; + } + + if (unlikely(ret < 0)) { + nr = pgarr->nleft - nr; + while (nr--) + page_cache_release(*(--pagep)); + return ret; + } + + *start = addr; + return (pgarr->nleft - nr); +} + +/** + * cr_vma_scan_pages - scan vma for pages that will need to be dumped + * @ctx - checkpoint context + * @vma - vma to scan + * + * a list of addr/page tuples is kept in ctx->pgarr page-array chain + */ +static int cr_vma_scan_pages(struct cr_ctx *ctx, struct vm_area_struct *vma) +{ + unsigned long addr = vma->vm_start; + unsigned long end = vma->vm_end; + struct cr_pgarr *pgarr; + int nr, total = 0; + + while (addr < end) { + if (!(pgarr = cr_pgarr_prep(ctx))) + return -ENOMEM; + if ((nr = cr_vma_fill_pgarr(ctx, pgarr, vma, &addr)) < 0) + return nr; + pgarr->nleft -= nr; + pgarr->nused += nr; + total += nr; + } + + CR_PRINTK("total %d\n", total); + return total; +} + +/** + * cr_vma_dump_pages - dump pages listed in the ctx page-array chain + * @ctx - checkpoint context + * @total - total number of pages + */ +static int cr_vma_dump_pages(struct cr_ctx *ctx, int total) +{ + struct cr_pgarr *pgarr; + int ret; + + if (!total) + return 0; + + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) { + ret = cr_kwrite(ctx, pgarr->addrs, + pgarr->nused * sizeof(*pgarr->addrs)); + if (ret < 0) + return ret; + } + + for (pgarr = ctx->pgarr; pgarr; pgarr = pgarr->next) { + struct page **pages = pgarr->pages; + int nr = pgarr->nused; + void *ptr; + + while (nr--) { + ptr = kmap(*pages); + ret = cr_kwrite(ctx, ptr, PAGE_SIZE); + kunmap(*pages); + if (ret < 0) + return ret; + pages++; + } + } + + return total; +} + +static int cr_write_vma(struct cr_ctx *ctx, struct vm_area_struct *vma) +{ + struct cr_hdr h; + struct cr_hdr_vma *hh = ctx->tbuf; + char *fname = NULL; + int how, nr, ret; + + h.type = CR_HDR_VMA; + h.len = sizeof(*hh); + h.id = ctx->pid; + + hh->vm_start = vma->vm_start; + hh->vm_end = vma->vm_end; + hh->vm_page_prot = vma->vm_page_prot.pgprot; + hh->vm_flags = vma->vm_flags; + hh->vm_pgoff = vma->vm_pgoff; + + if (vma->vm_flags & (VM_SHARED | VM_IO | VM_HUGETLB | VM_NONLINEAR)) { + printk(KERN_WARNING "CR: unknown VMA %#lx\n", vma->vm_flags); + return -ETXTBSY; + } + + /* by default assume anon memory */ + how = CR_VMA_ANON; + + /* if there is a backing file, assume private-mapped */ + /* (NEED: check if the file is unlinked) */ + if (vma->vm_file) { + nr = PAGE_SIZE; + fname = cr_get_fname(&vma->vm_file->f_path, + ctx->vfsroot, ctx->tbuf, &nr); + if (IS_ERR(fname)) + return PTR_ERR(fname); + hh->namelen = nr; + how = CR_VMA_FILE; + } else + hh->namelen = 0; + + hh->how = how; + + /* + * it seems redundant now, but we do it in 3 steps for because: + * first, the logic is simpler when we how many pages before + * dumping them; second, a future optimization will defer the + * writeout (dump, and free) to a later step; in which case all + * the pages to be dumped will be aggregated on the checkpoint ctx + */ + + /* (1) scan: scan through the PTEs of the vma, both to count the + * pages to dump, and make those pages COW. keep the list of pages + * (and a reference to each page) on the checkpoint ctx */ + nr = cr_vma_scan_pages(ctx, vma); + if (nr < 0) { + cr_put_fname(ctx->tbuf, fname, PAGE_SIZE); + return nr; + } + + hh->npages = nr; + ret = cr_write_obj(ctx, &h, hh); + + if (!ret && hh->namelen) + ret = cr_write_str(ctx, fname, hh->namelen); + + cr_put_fname(ctx->tbuf, fname, PAGE_SIZE); + + if (ret < 0) + return ret; + + /* (2) dump: write out the addresses of all pages in the list (on + * the checkpoint ctx) followed by the contents of all pages */ + ret = cr_vma_dump_pages(ctx, nr); + + /* (3) free: free the extra references to the pages in the list */ + cr_pgarr_release(ctx); + + return ret; +} + +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t) +{ + struct cr_hdr h; + struct cr_hdr_mm *hh = ctx->tbuf; + struct mm_struct *mm; + struct vm_area_struct *vma; + int ret; + + h.type = CR_HDR_MM; + h.len = sizeof(*hh); + h.id = ctx->pid; + + mm = get_task_mm(t); + + hh->tag = 1; /* non-zero will mean first time encounter */ + + hh->start_code = mm->start_code; + hh->end_code = mm->end_code; + hh->start_data = mm->start_data; + hh->end_data = mm->end_data; + hh->start_brk = mm->start_brk; + hh->brk = mm->brk; + hh->start_stack = mm->start_stack; + hh->arg_start = mm->arg_start; + hh->arg_end = mm->arg_end; + hh->env_start = mm->env_start; + hh->env_end = mm->env_end; + + hh->map_count = mm->map_count; + + /* FIX: need also mm->flags */ + + ret = cr_write_obj(ctx, &h, hh); + if (ret < 0) + goto out; + + /* write the vma's */ + down_read(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if ((ret = cr_write_vma(ctx, vma)) < 0) + break; + } + up_read(&mm->mmap_sem); + + if (ret < 0) + goto out; + + ret = cr_write_mm_context(ctx, mm); + + out: + mmput(mm); + return ret; +} diff -puN /dev/null ckpt/ckpt_mem.h --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_mem.h 2008-08-05 08:37:29.000000000 -0700 @@ -0,0 +1,32 @@ +/* + * Generic container checkpoint-restart + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <linux/mm_types.h> + +/* page-array chains: each pgarr hols a list of <addr,page> tuples */ +struct cr_pgarr { + unsigned long *addrs; + struct page **pages; + struct cr_pgarr *next; + unsigned short nleft; + unsigned short nused; +}; + +/* vma subtypes */ +enum { + CR_VMA_ANON = 1, + CR_VMA_FILE +}; + +extern void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr); +extern void cr_pgarr_release(struct cr_ctx *ctx); +extern void cr_pgarr_free(struct cr_ctx *ctx); +extern struct cr_pgarr *cr_pgarr_alloc(struct cr_ctx *ctx, struct cr_pgarr **pgnew); +extern struct cr_pgarr *cr_pgarr_prep(struct cr_ctx *ctx); diff -puN ckpt/Makefile~memory_part ckpt/Makefile --- linux-2.6.git/ckpt/Makefile~memory_part 2008-08-05 08:37:29.000000000 -0700 +++ linux-2.6.git-dave/ckpt/Makefile 2008-08-05 08:37:29.000000000 -0700 @@ -1,2 +1,2 @@ -obj-y += sys.o checkpoint.o restart.o +obj-y += sys.o checkpoint.o restart.o ckpt_mem.o rstr_mem.o obj-$(CONFIG_X86) += x86.o diff -puN /dev/null ckpt/rstr_mem.c --- /dev/null 2007-04-11 11:48:27.000000000 -0700 +++ linux-2.6.git-dave/ckpt/rstr_mem.c 2008-08-05 08:37:29.000000000 -0700 @@ -0,0 +1,354 @@ +/* + * Restart memory contents + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include <asm/unistd.h> + +#include <linux/sched.h> +#include <linux/fcntl.h> +#include <linux/file.h> +#include <linux/fs.h> +#include <linux/uaccess.h> +#include <linux/mm_types.h> +#include <linux/mman.h> +#include <linux/mm.h> +#include <linux/err.h> +#include <asm/cacheflush.h> + +#include "ckpt.h" +#include "ckpt_arch.h" +#include "ckpt_hdr.h" +#include "ckpt_mem.h" + +/* + * Unlike checkpoint, restart is executed in the context of each restarting + * process: vma regions are restored via a call to mmap(), and the data is + * read in directly to the address space of the current process + */ + +/** + * cr_vma_read_pages_addr - read addresses of pages to page-array chain + * @ctx - restart context + * @npages - number of pages + */ +static int cr_vma_read_pages_addr(struct cr_ctx *ctx, int npages) +{ + struct cr_pgarr *pgarr; + int nr, ret; + + while (npages) { + if (!(pgarr = cr_pgarr_prep(ctx))) + return -ENOMEM; + nr = min(npages, (int) pgarr->nleft); + ret = cr_kread(ctx, pgarr->addrs, nr * sizeof(unsigned long)); + if (ret < 0) + return ret; + pgarr->nleft -= nr; + pgarr->nused += nr; + npages -= nr; + } + return 0; +} + +/** + * cr_vma_read_pages_data - read in data of pages in page-array chain + * @ctx - restart context + * @npages - number of pages + */ +static int cr_vma_read_pages_data(struct cr_ctx *ctx, int npages) +{ + struct cr_pgarr *pgarr; + unsigned long *addrs; + int nr, ret; + + for (pgarr = ctx->pgarr; npages; pgarr = pgarr->next) { + addrs = pgarr->addrs; + nr = pgarr->nused; + npages -= nr; + while (nr--) { + ret = cr_uread(ctx, (void *) *(addrs++), PAGE_SIZE); + if (ret < 0) + return ret; + } + } + + return 0; +} + +/* change the protection of an address range to be writable/non-writable. + * this is useful when restoring the memory of a read-only vma */ +static int cr_vma_writable(struct mm_struct *mm, unsigned long start, + unsigned long end, int writable) +{ + struct vm_area_struct *vma, *prev; + unsigned long flags = 0; + int ret = -EINVAL; + + CR_PRINTK("vma %#lx-%#lx writable %d\n", start, end, writable); + + down_write(&mm->mmap_sem); + vma = find_vma_prev(mm, start, &prev); + if (unlikely(!vma || vma->vm_start > end || vma->vm_end < start)) + goto out; + if (writable && !(vma->vm_flags & VM_WRITE)) + flags = vma->vm_flags | VM_WRITE; + else if (!writable && (vma->vm_flags & VM_WRITE)) + flags = vma->vm_flags & ~VM_WRITE; + CR_PRINTK("flags %#lx\n", flags); + if (flags) + ret = mprotect_fixup(vma, &prev, vma->vm_start, + vma->vm_end, flags); + out: + up_write(&mm->mmap_sem); + return ret; +} + +/** + * cr_vma_read_pages - read in pages for to restore a vma + * @ctx - restart context + * @cr_vma - vma descriptor from restart + */ +static int cr_vma_read_pages(struct cr_ctx *ctx, struct cr_hdr_vma *cr_vma) +{ + struct mm_struct *mm = current->mm; + int ret = 0; + + if (!cr_vma->npages) + return 0; + + /* in the unlikely case that this vma is read-only */ + if (!(cr_vma->vm_flags & VM_WRITE)) + ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 1); + + if (!ret) + ret = cr_vma_read_pages_addr(ctx, cr_vma->npages); + if (!ret) + ret = cr_vma_read_pages_data(ctx, cr_vma->npages); + if (ret < 0) + return ret; + + cr_pgarr_release(ctx); /* reset page-array chain */ + + /* restore original protection for this vma */ + if (!(cr_vma->vm_flags & VM_WRITE)) + ret = cr_vma_writable(mm, cr_vma->vm_start, cr_vma->vm_end, 0); + + return ret; +} + +/** + * cr_calc_map_prot_bits - convert vm_flags to mmap protection + * orig_vm_flags: source vm_flags + */ +static unsigned long cr_calc_map_prot_bits(unsigned long orig_vm_flags) +{ + unsigned long vm_prot = 0; + + if (orig_vm_flags & VM_READ) + vm_prot |= PROT_READ; + if (orig_vm_flags & VM_WRITE) + vm_prot |= PROT_WRITE; + if (orig_vm_flags & VM_EXEC) + vm_prot |= PROT_EXEC; + if (orig_vm_flags & PROT_SEM) /* only (?) with IPC-SHM */ + vm_prot |= PROT_SEM; + + return vm_prot; +} + +/** + * cr_calc_map_flags_bits - convert vm_flags to mmap flags + * orig_vm_flags: source vm_flags + */ +static unsigned long cr_calc_map_flags_bits(unsigned long orig_vm_flags) +{ + unsigned long vm_flags = 0; + + vm_flags = MAP_FIXED; + if (orig_vm_flags & VM_GROWSDOWN) + vm_flags |= MAP_GROWSDOWN; + if (orig_vm_flags & VM_DENYWRITE) + vm_flags |= MAP_DENYWRITE; + if (orig_vm_flags & VM_EXECUTABLE) + vm_flags |= MAP_EXECUTABLE; + if (orig_vm_flags & VM_MAYSHARE) + vm_flags |= MAP_SHARED; + else + vm_flags |= MAP_PRIVATE; + + return vm_flags; +} + +static int cr_read_vma(struct cr_ctx *ctx, struct mm_struct *mm) +{ + struct cr_hdr_vma *hh = cr_hbuf_get(ctx, sizeof(*hh)); + unsigned long vm_size, vm_flags, vm_prot, vm_pgoff; + unsigned long addr; + unsigned long flags; + struct file *file = NULL; + char *fname = NULL; + int ret; + + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_VMA); + if (ret < 0) + return ret; + + CR_PRINTK("vma %#lx-%#lx npages %d namelen %d\n", + (unsigned long) hh->vm_start, (unsigned long) hh->vm_end, + (int) hh->npages, (int) hh->namelen); + + if (hh->vm_end < hh->vm_start) + return -EINVAL; + if (hh->npages < 0 || hh->namelen < 0) + return -EINVAL; + + vm_size = hh->vm_end - hh->vm_start; + vm_prot = cr_calc_map_prot_bits(hh->vm_flags); + vm_flags = cr_calc_map_flags_bits(hh->vm_flags); + vm_pgoff = hh->vm_pgoff; + + if (hh->namelen) { + fname = ctx->tbuf; + ret = cr_read_str(ctx, fname, PAGE_SIZE); + if (ret < 0) + return ret; + } + + CR_PRINTK("vma fname '%s' how %d\n", fname, hh->how); + + switch (hh->how) { + + case CR_VMA_ANON: /* anonymous private mapping */ + if (hh->namelen) + return -EINVAL; + /* vm_pgoff for anonymous mapping is the "global" page + offset (namely from addr 0x0), so we force a zero */ + vm_pgoff = 0; + break; + + case CR_VMA_FILE: /* private mapping from a file */ + if (!hh->namelen) + return -EINVAL; + /* O_RDWR only needed if both (VM_WRITE|VM_SHARED) are set */ + flags = hh->vm_flags & (VM_WRITE | VM_SHARED); + flags = (flags == (VM_WRITE | VM_SHARED) ? O_RDWR : O_RDONLY); + file = filp_open(fname, flags, 0); + if (IS_ERR(file)) + return PTR_ERR(file); + break; + + default: + return -EINVAL; + + } + + addr = do_mmap_pgoff(file, (unsigned long) hh->vm_start, + vm_size, vm_prot, vm_flags, vm_pgoff); + CR_PRINTK("vma size %#lx prot %#lx flags %#lx pgoff %#lx => %#lx\n", + vm_size, vm_prot, vm_flags, vm_pgoff, addr); + + /* the file (if opened) is now referenced by the vma */ + if (file) + filp_close(file, NULL); + + if (IS_ERR((void*) addr)) + return (PTR_ERR((void *) addr)); + + /* + * CR_VMA_ANON: read in memory as is + * CR_VMA_FILE: read in memory as is + * (more to follow ...) + */ + + switch (hh->how) { + case CR_VMA_ANON: + case CR_VMA_FILE: + /* standard case: read the data into the memory */ + ret = cr_vma_read_pages(ctx, hh); + break; + } + + if (ret < 0) + return ret; + + if (vm_prot & PROT_EXEC) + flush_icache_range(hh->vm_start, hh->vm_end); + + cr_hbuf_put(ctx, sizeof(*hh)); + CR_PRINTK("vma retval %d\n", ret); + return 0; +} + +static int cr_destroy_mm(struct mm_struct *mm) +{ + struct vm_area_struct *vmnext = mm->mmap; + struct vm_area_struct *vma; + int ret; + + while (vmnext) { + vma = vmnext; + vmnext = vmnext->vm_next; + ret = do_munmap(mm, vma->vm_start, vma->vm_end-vma->vm_start); + if (ret < 0) + return ret; + } + return 0; +} + +int cr_read_mm(struct cr_ctx *ctx) +{ + struct cr_hdr_mm *hh = cr_hbuf_get(ctx, sizeof(*hh)); + struct mm_struct *mm; + int nr, ret; + + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM); + if (ret < 0) + return ret; + + CR_PRINTK("map_count %d\n", hh->map_count); + + /* XXX need more sanity checks */ + if (hh->start_code > hh->end_code || + hh->start_data > hh->end_data || hh->map_count < 0) + return -EINVAL; + + mm = current->mm; + + /* point of no return -- destruct current mm */ + down_write(&mm->mmap_sem); + ret = cr_destroy_mm(mm); + up_write(&mm->mmap_sem); + + if (ret < 0) + return ret; + + mm->start_code = hh->start_code; + mm->end_code = hh->end_code; + mm->start_data = hh->start_data; + mm->end_data = hh->end_data; + mm->start_brk = hh->start_brk; + mm->brk = hh->brk; + mm->start_stack = hh->start_stack; + mm->arg_start = hh->arg_start; + mm->arg_end = hh->arg_end; + mm->env_start = hh->env_start; + mm->env_end = hh->env_end; + + /* FIX: need also mm->flags */ + + for (nr = hh->map_count; nr; nr--) { + ret = cr_read_vma(ctx, mm); + if (ret < 0) + return ret; + } + + cr_hbuf_put(ctx, sizeof(*hh)); + + return cr_read_mm_context(ctx, mm); +} diff -puN ckpt/sys.c~memory_part ckpt/sys.c --- linux-2.6.git/ckpt/sys.c~memory_part 2008-08-05 08:37:29.000000000 -0700 +++ linux-2.6.git-dave/ckpt/sys.c 2008-08-05 08:37:29.000000000 -0700 @@ -15,6 +15,7 @@ #include <linux/capability.h> #include "ckpt.h" +#include "ckpt_mem.h" /* * helpers to write/read to/from the image file descriptor @@ -118,6 +119,8 @@ void cr_ctx_free(struct cr_ctx *ctx) if (ctx->vfsroot) path_put(ctx->vfsroot); + cr_pgarr_free(ctx); + free_pages((unsigned long) ctx->tbuf, CR_ORDER_TBUF); free_pages((unsigned long) ctx->hbuf, CR_ORDER_HBUF); diff -puN ckpt/x86.c~memory_part ckpt/x86.c --- linux-2.6.git/ckpt/x86.c~memory_part 2008-08-05 08:37:29.000000000 -0700 +++ linux-2.6.git-dave/ckpt/x86.c 2008-08-05 08:37:29.000000000 -0700 @@ -1,5 +1,6 @@ #include <asm/ckpt.h> #include <asm/desc.h> +#include <asm/ldt.h> #include <asm/i387.h> #include "ckpt.h" @@ -267,3 +268,85 @@ int cr_read_cpu(struct cr_ctx *ctx) return 0; } + +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm) +{ + struct cr_hdr h; + struct cr_hdr_mm_context *hh = ctx->tbuf; + int ret; + + h.type = CR_HDR_MM_CONTEXT; + h.len = sizeof(*hh); + h.id = ctx->pid; + + mutex_lock(&mm->context.lock); + + hh->ldt_entry_size = LDT_ENTRY_SIZE; + hh->nldt = mm->context.size; + + CR_PRINTK("nldt %d\n", hh->nldt); + + ret = cr_write_obj(ctx, &h, hh); + if (ret < 0) + return ret; + + ret = cr_kwrite(ctx, mm->context.ldt, hh->nldt * LDT_ENTRY_SIZE); + + mutex_unlock(&mm->context.lock); + + return ret; +} + +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm) +{ + struct cr_hdr_mm_context *hh = cr_hbuf_get(ctx, sizeof(*hh)); + int n, ret; + + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_MM_CONTEXT); + if (ret < 0) + return ret; + + CR_PRINTK("nldt %d\n", hh->nldt); + + if (hh->nldt < 0 || hh->ldt_entry_size != LDT_ENTRY_SIZE) + return -EINVAL; + + /* to utilize the syscall modify_ldt() we first convert the data + * in the checkpoint image from 'struct desc_struct' to 'struct + * user_desc' with reverse logic of inclue/asm/desc.h:fill_ldt() */ + + for (n = 0; n < hh->nldt; n++) { + struct user_desc info; + struct desc_struct desc; + mm_segment_t old_fs; + + ret = cr_kread(ctx, &desc, LDT_ENTRY_SIZE); + if (ret < 0) + return ret; + + info.entry_number = n; + info.base_addr = desc.base0 | (desc.base1 << 16); + info.limit = desc.limit0; + info.seg_32bit = desc.d; + info.contents = desc.type >> 2; + info.read_exec_only = (desc.type >> 1) ^ 1; + info.limit_in_pages = desc.g; + info.seg_not_present = desc.p ^ 1; + info.useable = desc.avl; + + old_fs = get_fs(); + set_fs(get_ds()); + /* ret = sys_modify_ldt(1, &info, sizeof(info)); */ + /* modified by daveh */ + ret = write_ldt(&info, sizeof(info), 1); + set_fs(old_fs); + + if (ret < 0) + return ret; + } + + load_LDT(&mm->context); + + cr_hbuf_put(ctx, sizeof(*hh)); + return 0; +} diff -puN include/asm-x86/ckpt.h~memory_part include/asm-x86/ckpt.h --- linux-2.6.git/include/asm-x86/ckpt.h~memory_part 2008-08-05 08:37:29.000000000 -0700 +++ linux-2.6.git-dave/include/asm-x86/ckpt.h 2008-08-05 08:37:29.000000000 -0700 @@ -43,4 +43,9 @@ struct cr_hdr_cpu { union thread_xstate xstate; /* i387 */ }; +struct cr_hdr_mm_context { + __s16 ldt_entry_size; + __s16 nldt; +}; + #endif /* __ASM_X86_CKPT_H */ diff -puN include/asm-x86/desc.h~memory_part include/asm-x86/desc.h --- linux-2.6.git/include/asm-x86/desc.h~memory_part 2008-08-05 08:37:29.000000000 -0700 +++ linux-2.6.git-dave/include/asm-x86/desc.h 2008-08-05 08:40:11.000000000 -0700 @@ -111,6 +111,8 @@ static inline void native_write_ldt_entr memcpy(&ldt[entry], desc, 8); } +int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode); + static inline void native_write_gdt_entry(struct desc_struct *gdt, int entry, const void *desc, int type) { @@ -394,7 +396,6 @@ static inline void set_system_gate_ist(i shll $16, base; \ movw idx * 8 + 2(gdt), lo_w; - #endif /* __ASSEMBLY__ */ #endif _ ^ permalink raw reply [flat|nested] 71+ messages in thread
* [RFC][PATCH 4/4] introduce sys_checkpoint and sys_restore [not found] <20080807224033.FFB3A2C1@kernel> ` (2 preceding siblings ...) 2008-08-07 22:40 ` [RFC][PATCH 3/4] checkpoint/restart: memory management Dave Hansen @ 2008-08-07 22:40 ` Dave Hansen 2008-08-08 9:25 ` [RFC][PATCH 0/4] kernel-based checkpoint restart Arnd Bergmann ` (4 subsequent siblings) 8 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-07 22:40 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> Create trivial sys_checkpoint and sys_restore system calls. They will enable to checkpoint and restart an entire container, to and from a checkpoint image file. First create a template for both syscalls: they take a file descriptor (for the image file) and flags as arguments. For sys_checkpoint the first argument identifies the target container; for sys_restart it will identify the checkpoint image. Signed-off-by: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> --- linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S | 2 ++ linux-2.6.git-dave/include/asm-x86/unistd_32.h | 2 ++ 2 files changed, 4 insertions(+) diff -puN arch/x86/kernel/syscall_table_32.S~introduce_sys_checkpoint_and_sys_restore arch/x86/kernel/syscall_table_32.S --- linux-2.6.git/arch/x86/kernel/syscall_table_32.S~introduce_sys_checkpoint_and_sys_restore 2008-08-07 15:38:04.000000000 -0700 +++ linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S 2008-08-07 15:38:04.000000000 -0700 @@ -326,3 +326,5 @@ ENTRY(sys_call_table) .long sys_fallocate .long sys_timerfd_settime /* 325 */ .long sys_timerfd_gettime + .long sys_checkpoint + .long sys_restart diff -puN include/asm-x86/unistd_32.h~introduce_sys_checkpoint_and_sys_restore include/asm-x86/unistd_32.h --- linux-2.6.git/include/asm-x86/unistd_32.h~introduce_sys_checkpoint_and_sys_restore 2008-08-07 15:38:04.000000000 -0700 +++ linux-2.6.git-dave/include/asm-x86/unistd_32.h 2008-08-07 15:38:04.000000000 -0700 @@ -332,6 +332,8 @@ #define __NR_fallocate 324 #define __NR_timerfd_settime 325 #define __NR_timerfd_gettime 326 +#define __NR_checkpoint 327 +#define __NR_restart 328 #ifdef __KERNEL__ diff -puN Makefile~introduce_sys_checkpoint_and_sys_restore Makefile _ ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 0/4] kernel-based checkpoint restart [not found] <20080807224033.FFB3A2C1@kernel> ` (3 preceding siblings ...) 2008-08-07 22:40 ` [RFC][PATCH 4/4] introduce sys_checkpoint and sys_restore Dave Hansen @ 2008-08-08 9:25 ` Arnd Bergmann [not found] ` <1218218784.19082.10.camel@nimitz> [not found] ` <200808081125.12706.arnd-r2nGTMty4D4@public.gmane.org> [not found] ` <20080807224035.E56663BF@kernel> ` (3 subsequent siblings) 8 siblings, 2 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-08 9:25 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Friday 08 August 2008, Dave Hansen wrote: > These patches are from Oren Laaden. I've refactored them > a bit to make them a wee bit more reviewable. I think this > separates out the per-arch bits pretty well. It should also > be at least build-bisetable. Cool stuff > ============================== ckpt.c ================================ > > #define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */ > > #include <stdio.h> > #include <stdlib.h> > #include <errno.h> > #include <fcntl.h> > #include <unistd.h> > #include <asm/unistd_32.h> > #include <sys/syscall.h> Note that asm/unistd_32.h is not portable, you should use asm/unistd.h in the example. > pid_t pid = getpid(); > int ret; > > ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0); Interface-wise, I would consider checkpointing yourself signficantly different from checkpointing some other thread. If checkpointing yourself is the common case, it probably makes sense to allow passing of pid=0 for this. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218218784.19082.10.camel@nimitz>]
* Re: [RFC][PATCH 0/4] kernel-based checkpoint restart [not found] ` <1218218784.19082.10.camel@nimitz> @ 2008-08-08 18:18 ` Arnd Bergmann 0 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-08 18:18 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Friday 08 August 2008, Dave Hansen wrote: > I don't think it is the common case. Probably now when we're screwing > around with it, but not in the future. Do you think it is worth adding > the pid=0 handling? If it's the exception, probably not. Otherwise it would be a nice shortcut to avoid having to do two system calls every time you write code using it. Then again, there are probably not many programs calling it anyway, if it get encapsulated in some user space tool. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808081125.12706.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [RFC][PATCH 0/4] kernel-based checkpoint restart [not found] ` <200808081125.12706.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-08 18:06 ` Dave Hansen 2008-08-08 19:44 ` Oren Laadan 1 sibling, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-08 18:06 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri, 2008-08-08 at 11:25 +0200, Arnd Bergmann wrote: > > pid_t pid = getpid(); > > int ret; > > > > ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0); > > Interface-wise, I would consider checkpointing yourself signficantly > different from checkpointing some other thread. If checkpointing > yourself is the common case, it probably makes sense to allow passing > of pid=0 for this. I don't think it is the common case. Probably now when we're screwing around with it, but not in the future. Do you think it is worth adding the pid=0 handling? -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 0/4] kernel-based checkpoint restart [not found] ` <200808081125.12706.arnd-r2nGTMty4D4@public.gmane.org> 2008-08-08 18:06 ` Dave Hansen @ 2008-08-08 19:44 ` Oren Laadan 1 sibling, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-08 19:44 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen Hi, Arnd Bergmann wrote: > On Friday 08 August 2008, Dave Hansen wrote: >> These patches are from Oren Laaden. I've refactored them >> a bit to make them a wee bit more reviewable. I think this >> separates out the per-arch bits pretty well. It should also >> be at least build-bisetable. > > Cool stuff > Thanks. This is a proof of concept so all sorts of feedback are definitely welcome. Some of the ideas and discussions are found around: http://wiki.openvz.org/Containers/Mini-summit_2008 and the notes: http://wiki.openvz.org/Containers/Mini-summit_2008_notes and the archives of the linux containers mailing list: https://lists.linux-foundation.org/pipermail/containers/ (August and July). Several aspects of the implementation are still experimental and I expect them to evolve with the feedback. In particular, expect the specific user interface (syscalls) and the checkpoint image format to be moving targets. >> ============================== ckpt.c ================================ >> >> #define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */ >> >> #include <stdio.h> >> #include <stdlib.h> >> #include <errno.h> >> #include <fcntl.h> >> #include <unistd.h> >> #include <asm/unistd_32.h> >> #include <sys/syscall.h> > > Note that asm/unistd_32.h is not portable, you should use asm/unistd.h > in the example. > >> pid_t pid = getpid(); >> int ret; >> >> ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0); > > Interface-wise, I would consider checkpointing yourself signficantly > different from checkpointing some other thread. If checkpointing > yourself is the common case, it probably makes sense to allow passing > of pid=0 for this. > The checkpoint/restart code is meant to checkpoint a whole container, that is be able to save the state of multiple other tasks. The same code can also be used to checkpoint yourself fairly easily with minimal changes (see comments in the code about "in context" checkpoint/restart that take care of this). I suggest to keep the interface as is in the sense that the pid will identify the target container (e.g. the pid of the init process of that container). Then, pid=0 would mean "the container to which I belong" if you are inside a container (and therefore don't know the pid of the init process there). Finally, to checkpoint yourself, you would set the a bit in the flags argument to something like CR_CKPT_MYSELF. Such a flag will be needed internally anyway to special-case self checkpoint where appropriate. Comments are welcome. Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <20080807224035.E56663BF@kernel>]
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <20080807224035.E56663BF@kernel> @ 2008-08-08 12:09 ` Arnd Bergmann [not found] ` <200808081409.30591.arnd@arndb.de> 1 sibling, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-08 12:09 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA > diff -puN /dev/null include/asm-x86/ckpt.h > --- /dev/null 2007-04-11 11:48:27.000000000 -0700 > +++ linux-2.6.git-dave/include/asm-x86/ckpt.h 2008-08-04 13:29:59.000000000 -0700 > @@ -0,0 +1,46 @@ > + > +struct cr_hdr_cpu { > + __u64 bx; > + __u64 cx; > + __u64 dx; > + __u64 si; > + __u64 di; > + __u64 bp; > + __u64 ax; > + __u64 ds; > + __u64 es; > + __u64 orig_ax; > + __u64 ip; > + __u64 cs; > + __u64 flags; > + __u64 sp; > + __u64 ss; > + __u64 fs; > + __u64 gs; > + > + __u64 debugreg0; > + __u64 debugreg1; > + __u64 debugreg2; > + __u64 debugreg3; > + __u64 debugreg6; > + __u64 debugreg7; > + > + __u8 uses_debug; > + > + __u8 used_math; > + __u8 has_fxsr; > + union thread_xstate xstate; /* i387 */ > +}; It seems weird that you use __u64 members for the registers, but don't include r8..r15 in the list. As a consequence, this structure does not seem well suited for either x86-32 or x86-64. I would suggest either using struct pt_regs by reference, or defining it so that you can use the same structure for both 32 and 64 bit x86. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808081409.30591.arnd@arndb.de>]
[parent not found: <200808081409.30591.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <200808081409.30591.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-08 20:28 ` Oren Laadan [not found] ` <489CAC70.7090809-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> [not found] ` <200808090029.28286.arnd@arndb.de> 0 siblings, 2 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-08 20:28 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen Hi, Thanks for the feedback. The proof-of-concept is written for x86 32 bits, keeping in mind that we'll need support for 64 bits support. My goal is to leverage feedback and contributions to have support for 64 bits and other architectures as well. Arnd Bergmann wrote: > >> diff -puN /dev/null include/asm-x86/ckpt.h >> --- /dev/null 2007-04-11 11:48:27.000000000 -0700 >> +++ linux-2.6.git-dave/include/asm-x86/ckpt.h 2008-08-04 13:29:59.000000000 -0700 >> @@ -0,0 +1,46 @@ > >> + >> +struct cr_hdr_cpu { >> + __u64 bx; >> + __u64 cx; >> + __u64 dx; >> + __u64 si; >> + __u64 di; >> + __u64 bp; >> + __u64 ax; >> + __u64 ds; >> + __u64 es; >> + __u64 orig_ax; >> + __u64 ip; >> + __u64 cs; >> + __u64 flags; >> + __u64 sp; >> + __u64 ss; >> + __u64 fs; >> + __u64 gs; >> + >> + __u64 debugreg0; >> + __u64 debugreg1; >> + __u64 debugreg2; >> + __u64 debugreg3; >> + __u64 debugreg6; >> + __u64 debugreg7; >> + >> + __u8 uses_debug; >> + >> + __u8 used_math; >> + __u8 has_fxsr; >> + union thread_xstate xstate; /* i387 */ >> +}; > > It seems weird that you use __u64 members for the registers, but don't > include r8..r15 in the list. As a consequence, this structure does not > seem well suited for either x86-32 or x86-64. In the context of CR, x86-32 and x86-64 are distinct architectures because you cannot always migrate from one to the other (though 32->64 is sometimes possible). Therefore, each architecture can have a separate checkpoint file format (eg r8..r15 only for x86-64). The information about the kernel configuration, version and cpu settings will appear on the header; so the restart code will know the architecture on which the checkpoint had been taken. So if we want to restart a task checkpointed on x86-32 on a x86-64 machine (in 32 bit mode), the code will know to not expect that data (r8..r15). Except for this special case (32 bit running 64 bit), simple conversion can be done in the kernel if needed, but most conversion between kernel the format for different kernel versions (should it change) can be done in user space (eg. with a filter). > > I would suggest either using struct pt_regs by reference, or defining > it so that you can use the same structure for both 32 and 64 bit x86. We prefer not to use the kernel structure directly, but an intermediate structure that can help mitigate subtle incompatibilities issues (between kernel configurations, versions, and even compiler versions). Anyway, either a single structure for both 32 and 64 bit x86, or separate "struct cr_hdr_cpu{_32,_64}", one for each architecture. Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <489CAC70.7090809-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <489CAC70.7090809-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2008-08-08 22:29 ` Arnd Bergmann 0 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-08 22:29 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen On Friday 08 August 2008, Oren Laadan wrote: > > It seems weird that you use __u64 members for the registers, but don't > > include r8..r15 in the list. As a consequence, this structure does not > > seem well suited for either x86-32 or x86-64. > > In the context of CR, x86-32 and x86-64 are distinct architectures because > you cannot always migrate from one to the other (though 32->64 is sometimes > possible). Therefore, each architecture can have a separate checkpoint file > format (eg r8..r15 only for x86-64). So why do you use __u64 members for your 32 bit registers? > Except for this special case (32 bit running 64 bit), simple conversion can > be done in the kernel if needed, but most conversion between kernel the > format for different kernel versions (should it change) can be done in > user space (eg. with a filter). The 32bit on 64bit case is quite common on non-x86 architectures, e.g. powerpc or sparc, where 64 bit kernels typically run 32 bit user space. A particularly interesting case is mixing 32 and 64 bit tasks in a container that you are checkpointing. This is a very realistic scenario, so there may be good arguments for keeping the format identical between the variations of each architecture. > > I would suggest either using struct pt_regs by reference, or defining > > it so that you can use the same structure for both 32 and 64 bit x86. > > We prefer not to use the kernel structure directly, but an intermediate > structure that can help mitigate subtle incompatibilities issues (between > kernel configurations, versions, and even compiler versions). > > Anyway, either a single structure for both 32 and 64 bit x86, or separate > "struct cr_hdr_cpu{_32,_64}", one for each architecture. struct pt_regs is part of the kernel ABI, it will not change. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808090029.28286.arnd@arndb.de>]
[parent not found: <200808090029.28286.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <200808090029.28286.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-08 23:04 ` Oren Laadan 0 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-08 23:04 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen Arnd Bergmann wrote: > On Friday 08 August 2008, Oren Laadan wrote: > >>> It seems weird that you use __u64 members for the registers, but don't >>> include r8..r15 in the list. As a consequence, this structure does not >>> seem well suited for either x86-32 or x86-64. >> In the context of CR, x86-32 and x86-64 are distinct architectures because >> you cannot always migrate from one to the other (though 32->64 is sometimes >> possible). Therefore, each architecture can have a separate checkpoint file >> format (eg r8..r15 only for x86-64). > > So why do you use __u64 members for your 32 bit registers? The idea was that x86-32 checkpoints can be restarted on a x86-64 node in 32 bit mode. Clearly it needed more thought... > >> Except for this special case (32 bit running 64 bit), simple conversion can >> be done in the kernel if needed, but most conversion between kernel the >> format for different kernel versions (should it change) can be done in >> user space (eg. with a filter). > > The 32bit on 64bit case is quite common on non-x86 architectures, e.g. > powerpc or sparc, where 64 bit kernels typically run 32 bit user space. > > A particularly interesting case is mixing 32 and 64 bit tasks in a container > that you are checkpointing. This is a very realistic scenario, so there > may be good arguments for keeping the format identical between the variations > of each architecture. > >>> I would suggest either using struct pt_regs by reference, or defining >>> it so that you can use the same structure for both 32 and 64 bit x86. >> We prefer not to use the kernel structure directly, but an intermediate >> structure that can help mitigate subtle incompatibilities issues (between >> kernel configurations, versions, and even compiler versions). >> >> Anyway, either a single structure for both 32 and 64 bit x86, or separate >> "struct cr_hdr_cpu{_32,_64}", one for each architecture. > > struct pt_regs is part of the kernel ABI, it will not change. I'm in favor about keeping the format identical between the variations of each architecture. Note, however, that "struct pt_regs" won't do because it may change with these variations. So we'll take care of the padding and add r8..r15 in the next version. Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <489CD0F9.9060603@cs.columbia.edu>]
[parent not found: <489CD0F9.9060603-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <489CD0F9.9060603-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2008-08-09 0:38 ` Dave Hansen 2008-08-09 6:43 ` Arnd Bergmann 1 sibling, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-09 0:38 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann On Fri, 2008-08-08 at 19:04 -0400, Oren Laadan wrote: > > struct pt_regs is part of the kernel ABI, it will not change. > > I'm in favor about keeping the format identical between the variations of > each architecture. Note, however, that "struct pt_regs" won't do because it > may change with these variations. "Part of the kernel ABI" makes it sound to me like it won't change. Who's right here? :) -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <489CD0F9.9060603-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2008-08-09 0:38 ` Dave Hansen @ 2008-08-09 6:43 ` Arnd Bergmann 1 sibling, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-09 6:43 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen On Saturday 09 August 2008, Oren Laadan wrote: > >> Anyway, either a single structure for both 32 and 64 bit x86, or separate > >> "struct cr_hdr_cpu{_32,_64}", one for each architecture. > > > > struct pt_regs is part of the kernel ABI, it will not change. > > I'm in favor about keeping the format identical between the variations of > each architecture. Note, however, that "struct pt_regs" won't do because it > may change with these variations. > > So we'll take care of the padding and add r8..r15 in the next version. > Fair enough. How about making the layout in that structure identical to the 64-bit pt_regs though? I don't know if we need that at any time, but my feeling is that it is nicer than a slightly different random layout, e.g. if someone wants to extend gdb to look at checkpointed process dumps. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218242286.19082.62.camel@nimitz>]
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <1218242286.19082.62.camel@nimitz> @ 2008-08-09 1:20 ` Oren Laadan [not found] ` <489CF0CE.1000603@cs.columbia.edu> ` (2 subsequent siblings) 3 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-09 1:20 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann Dave Hansen wrote: > On Fri, 2008-08-08 at 19:04 -0400, Oren Laadan wrote: >>> struct pt_regs is part of the kernel ABI, it will not change. >> I'm in favor about keeping the format identical between the variations of >> each architecture. Note, however, that "struct pt_regs" won't do because it >> may change with these variations. > > "Part of the kernel ABI" makes it sound to me like it won't change. > Who's right here? :) > > -- Dave > hehehe .. both; I meant that while it doesn't change per architecture, it varies between architectures. So "struct pt_regs" compiled for x86-32 is different than that compiled for x86-64. Therefore we can't just dump the structure as is and expect that 64 bit would be able to parse the 32 bit. In other words, we need an intermediate representation. Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <489CF0CE.1000603@cs.columbia.edu>]
[parent not found: <489CF0CE.1000603-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <489CF0CE.1000603-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2008-08-09 2:20 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-09 2:20 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann On Fri, 2008-08-08 at 21:20 -0400, Oren Laadan wrote: > hehehe .. both; I meant that while it doesn't change per architecture, it > varies between architectures. So "struct pt_regs" compiled for x86-32 is > different than that compiled for x86-64. Therefore we can't just dump the > structure as is and expect that 64 bit would be able to parse the 32 bit. > In other words, we need an intermediate representation. Surely we already handle this, though. Don't we allow a 32-bit app running on a 64-bit kernel to PTRACE_GETREGS and get the 32-bit version? A 64-bit app will get the 64-bit version making the same syscall. It's all handled in the syscall compatibility code. -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218248458.19082.68.camel@nimitz>]
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <1218248458.19082.68.camel@nimitz> @ 2008-08-09 2:35 ` Oren Laadan 0 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-09 2:35 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann Dave Hansen wrote: > On Fri, 2008-08-08 at 21:20 -0400, Oren Laadan wrote: >> hehehe .. both; I meant that while it doesn't change per architecture, it >> varies between architectures. So "struct pt_regs" compiled for x86-32 is >> different than that compiled for x86-64. Therefore we can't just dump the >> structure as is and expect that 64 bit would be able to parse the 32 bit. >> In other words, we need an intermediate representation. > > Surely we already handle this, though. Don't we allow a 32-bit app > running on a 64-bit kernel to PTRACE_GETREGS and get the 32-bit version? > A 64-bit app will get the 64-bit version making the same syscall. It's > all handled in the syscall compatibility code. Sure, that's a compatibility layer around ptrace() in the 64-bit kernel. Recall that Arnd suggested "keeping the format identical between the variations of each architecture", and I fully agree. If we want to keep the format identical, we can't simply define: struct cr_hdr_cpu { struct pt_regs regs; ... }; because that will compile differently on x86-32 and x86-64. So either we add r8..r15 to the structure as it appears in the patch now (and keep the format identical), or allow the format to vary, and explicitly test for this case and add a compatibility layer. Personally I prefer the former. Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <1218242286.19082.62.camel@nimitz> 2008-08-09 1:20 ` Oren Laadan [not found] ` <489CF0CE.1000603@cs.columbia.edu> @ 2008-08-10 14:55 ` Jeremy Fitzhardinge [not found] ` <489F015E.9080704@goop.org> 3 siblings, 0 replies; 71+ messages in thread From: Jeremy Fitzhardinge @ 2008-08-10 14:55 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann Dave Hansen wrote: > On Fri, 2008-08-08 at 19:04 -0400, Oren Laadan wrote: > >>> struct pt_regs is part of the kernel ABI, it will not change. >>> >> I'm in favor about keeping the format identical between the variations of >> each architecture. Note, however, that "struct pt_regs" won't do because it >> may change with these variations. >> > > "Part of the kernel ABI" makes it sound to me like it won't change. > Who's right here? :) Struct pt_regs is not ABI, and can (and has) changed on x86. It's not suitable for a checkpoint structure because it only contains the registers that the kernel trashes, not all usermode registers (on i386, it leaves out %gs, for example). asm-x86/ptrace-abi.h does define stuff that's fixed in stone; it expresses it in terms of a register array, with constants defining what element is which register. J ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <489F015E.9080704@goop.org>]
[parent not found: <489F015E.9080704-TSDbQ3PG+2Y@public.gmane.org>]
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <489F015E.9080704-TSDbQ3PG+2Y@public.gmane.org> @ 2008-08-11 15:36 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-11 15:36 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann On Sun, 2008-08-10 at 07:55 -0700, Jeremy Fitzhardinge wrote: > Struct pt_regs is not ABI, and can (and has) changed on x86. It's not > suitable for a checkpoint structure because it only contains the > registers that the kernel trashes, not all usermode registers (on i386, > it leaves out %gs, for example). asm-x86/ptrace-abi.h does define stuff > that's fixed in stone; it expresses it in terms of a register array, > with constants defining what element is which register. Thanks for the explanation. I just want to reduce the coding and maintenance burden here. Xen must do this for partition mobility, right? Does it define all its own stuff? -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218468964.5598.3.camel@nimitz>]
* Re: [RFC][PATCH 2/4] checkpoint/restart: x86 support [not found] ` <1218468964.5598.3.camel@nimitz> @ 2008-08-11 16:07 ` Jeremy Fitzhardinge 0 siblings, 0 replies; 71+ messages in thread From: Jeremy Fitzhardinge @ 2008-08-11 16:07 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann Dave Hansen wrote: > On Sun, 2008-08-10 at 07:55 -0700, Jeremy Fitzhardinge wrote: > >> Struct pt_regs is not ABI, and can (and has) changed on x86. It's not >> suitable for a checkpoint structure because it only contains the >> registers that the kernel trashes, not all usermode registers (on i386, >> it leaves out %gs, for example). asm-x86/ptrace-abi.h does define stuff >> that's fixed in stone; it expresses it in terms of a register array, >> with constants defining what element is which register. >> > > Thanks for the explanation. > > I just want to reduce the coding and maintenance burden here. Xen must > do this for partition mobility, right? Does it define all its own > stuff? > You mean save/restore/migrate? Yes, it defines all its own stuff. Checkpoint-resume on a whole VM is a rather simpler operation than a subset of processes. J ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <20080807224037.44DA0DB8@kernel>]
* Re: [RFC][PATCH 3/4] checkpoint/restart: memory management [not found] ` <20080807224037.44DA0DB8@kernel> @ 2008-08-08 12:12 ` Arnd Bergmann 0 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-08 12:12 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Friday 08 August 2008, Dave Hansen wrote: > > diff -puN ckpt/ckpt_hdr.h~memory_part ckpt/ckpt_hdr.h > --- linux-2.6.git/ckpt/ckpt_hdr.h~memory_part 2008-08-05 08:37:29.000000000 -0700 > +++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-05 08:37:29.000000000 -0700 > @@ -67,3 +67,24 @@ struct cr_hdr_task { > }; > > > + > +struct cr_hdr_mm { > + __u32 tag; /* sharing identifier */ > + __u64 start_code, end_code, start_data, end_data; > + __u64 start_brk, brk, start_stack; > + __u64 arg_start, arg_end, env_start, env_end; > + __s16 map_count; > +}; Another structure that is not 32/64 bit ABI safe on x86. It would be safe if you reorder the members as struct cr_hdr_mm { __u32 tag; /* sharing identifier */ __s16 map_count; __u16 pad; /* not actually needed, but better to make it explicit */ __u64 start_code, end_code, start_data, end_data; __u64 start_brk, brk, start_stack; __u64 arg_start, arg_end, env_start, env_end; }; > +struct cr_hdr_vma { > + __u32 how; > + > + __u64 vm_start; > + __u64 vm_end; > + __u64 vm_page_prot; > + __u64 vm_flags; > + __u64 vm_pgoff; > + > + __s16 npages; > + __s16 namelen; > +}; same here. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <20080807224038.0B03CEEF@kernel>]
* Re: [RFC][PATCH 4/4] introduce sys_checkpoint and sys_restore [not found] ` <20080807224038.0B03CEEF@kernel> @ 2008-08-08 12:15 ` Arnd Bergmann [not found] ` <200808081415.19179.arnd@arndb.de> 1 sibling, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-08 12:15 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Friday 08 August 2008, Dave Hansen wrote: > > linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S | 2 ++ > linux-2.6.git-dave/include/asm-x86/unistd_32.h | 2 ++ > 2 files changed, 4 insertions(+) System calls should also be declared in include/linux/syscalls.h. I guess you are aware that this implementation is not enough to support 32 bit tasks on x86_64. In addition to the native 64-bit code, you would also need the 32-bit compat code here. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808081415.19179.arnd@arndb.de>]
[parent not found: <200808081415.19179.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [RFC][PATCH 4/4] introduce sys_checkpoint and sys_restore [not found] ` <200808081415.19179.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-08 20:33 ` Oren Laadan 0 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-08 20:33 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen Arnd Bergmann wrote: > On Friday 08 August 2008, Dave Hansen wrote: >> linux-2.6.git-dave/arch/x86/kernel/syscall_table_32.S | 2 ++ >> linux-2.6.git-dave/include/asm-x86/unistd_32.h | 2 ++ >> 2 files changed, 4 insertions(+) > > System calls should also be declared in include/linux/syscalls.h. > > I guess you are aware that this implementation is not enough to > support 32 bit tasks on x86_64. In addition to the native 64-bit > code, you would also need the 32-bit compat code here. Yes, of course. The current code does not attempt to do that yet. Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <20080807224034.735B1F84@kernel>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <20080807224034.735B1F84@kernel> @ 2008-08-08 9:46 ` Arnd Bergmann [not found] ` <200808081146.54834.arnd@arndb.de> ` (3 subsequent siblings) 4 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-08 9:46 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Friday 08 August 2008, Dave Hansen wrote: > +/* write the checkpoint header */ > +static int cr_write_hdr(struct cr_ctx *ctx) > +{ > + struct cr_hdr h; > + struct cr_hdr_head *hh = ctx->tbuf; > + struct timeval ktv; > + > + h.type = CR_HDR_HEAD; > + h.len = sizeof(*hh); > + h.id = 0; > + > + do_gettimeofday(&ktv); > + > + hh->magic = 0x00a2d200; > + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff; > + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff; > + hh->patch = (LINUX_VERSION_CODE) & 0xff; > + > + hh->version = 1; > + > + hh->flags = ctx->flags; > + hh->time = ktv.tv_sec; > + > + return cr_write_obj(ctx, &h, hh); > +} Do you rely on the kernel version in order to determine the format of the binary data, or is it just informational? If you think the format can change in incompatible ways, you probably need something more specific than the version number these days, because there are just so many different trees with the same numbers. > + > +/* debugging */ > +#if 0 > +#define CR_PRINTK(str, args...) \ > + printk(KERN_ERR "cr@%s#%d: " str, __func__, __LINE__, ##args) > +#else > +#define CR_PRINTK(...) do {} while (0) > +#endif > + Please use the existing pr_debug and dev_debug here, instead of creating yet another version. > +struct cr_hdr_tail { > + __u32 magic; > + __u32 cksum[2]; > +}; This structure has an odd multiple of 32-bit members, which means that if you put it into a larger structure that also contains 64-bit members, the larger structure may get different alignment on x86-32 and x86-64, which you might want to avoid. I can't tell if this is an actual problem here. > + > +struct cr_hdr_task { > + __u64 state; > + __u32 exit_state; > + __u32 exit_code, exit_signal; > + > + __u16 pid; > + __u16 tgid; > + > + __u64 utime, stime, utimescaled, stimescaled; > + __u64 gtime; > + __u64 prev_utime, prev_stime; > + __u64 nvcsw, nivcsw; > + __u64 start_time_sec, start_time_nsec; > + __u64 real_start_time_sec, real_start_time_nsec; > + __u64 min_flt, maj_flt; > + > + __s16 task_comm_len; > + char comm[TASK_COMM_LEN]; > +}; In this case, I'm pretty sure that sizeof(cr_hdr_task) on x86-32 is different from x86-64, since it will be 32-bit aligned on x86-32. > + > +/* > + * During restart the code reads in data from the chekcpoint image into a > + * temporary buffer (ctx->hbuf). Because operations can be nested, one > + * should call cr_hbuf_get() to reserve space in the buffer, and then > + * cr_hbuf_put() when it no longer needs that space > + */ > + > +#include <linux/version.h> > +#include <linux/sched.h> > +#include <linux/file.h> > + > +#include "ckpt.h" > +#include "ckpt_hdr.h" > + > +/** > + * cr_hbuf_get - reserve space on the hbuf > + * @ctx: checkpoint context > + * @n: number of bytes to reserve > + */ > +void *cr_hbuf_get(struct cr_ctx *ctx, int n) > +{ > + void *ptr; > + > + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL); > + ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos); > + ctx->hpos += n; > + return ptr; > +} Can (ctx->hpos + n > CR_HBUF_TOTAL) be controlled by the input data? If so, this is a denial-of-service attack. > + > +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count) > +{ > + mm_segment_t oldfs; > + int ret; > + > + oldfs = get_fs(); > + set_fs(KERNEL_DS); > + ret = cr_uwrite(ctx, buf, count); > + set_fs(oldfs); > + > + return ret; > +} get_fs()/set_fs() always feels a bit ouch, and this way you have to use __force to avoid the warnings about __user pointer casts in sparse. I wonder if you can use splice_read/splice_write to get around this problem. > + struct cr_ctx *ctx; > + struct file *file; > + int fput_needed; > + int ret; > + > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; > + Why do you need CAP_SYS_ADMIN for this? Can't regular users be allowed to checkpoint/restart their own tasks? > --- linux-2.6.git/Makefile~handle_a_single_task_with_private_memory_maps 2008-08-05 09:04:27.000000000 -0700 > +++ linux-2.6.git-dave/Makefile 2008-08-05 09:07:53.000000000 -0700 > @@ -611,7 +611,7 @@ export mod_strip_cmd > > > ifeq ($(KBUILD_EXTMOD),) > -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ > +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/ > The name 'ckpt' is a bit unobvious, how about naming it 'checkpoint' instead? Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808081146.54834.arnd@arndb.de>]
[parent not found: <200808081146.54834.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <200808081146.54834.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-08 18:50 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-08 18:50 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote: > On Friday 08 August 2008, Dave Hansen wrote: > > + hh->magic = 0x00a2d200; > > + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff; > > + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff; > > + hh->patch = (LINUX_VERSION_CODE) & 0xff; ... > > +} > > Do you rely on the kernel version in order to determine the format > of the binary data, or is it just informational? > > If you think the format can change in incompatible ways, you > probably need something more specific than the version number > these days, because there are just so many different trees with > the same numbers. Yeah, this is very true. My guess is that we'll need something like what we do with modversions. > > +/* debugging */ > > +#if 0 > > +#define CR_PRINTK(str, args...) \ > > + printk(KERN_ERR "cr@%s#%d: " str, __func__, __LINE__, ##args) > > +#else > > +#define CR_PRINTK(...) do {} while (0) > > +#endif > > + > > Please use the existing pr_debug and dev_debug here, instead of creating > yet another version. Sure thing. Will do. > > +struct cr_hdr_tail { > > + __u32 magic; > > + __u32 cksum[2]; > > +}; > > This structure has an odd multiple of 32-bit members, which means > that if you put it into a larger structure that also contains > 64-bit members, the larger structure may get different alignment > on x86-32 and x86-64, which you might want to avoid. > I can't tell if this is an actual problem here. Can't we just declare all these things __packed__ and stop worrying about aligning them all manually? > > + > > +/* > > + * During restart the code reads in data from the chekcpoint image into a > > + * temporary buffer (ctx->hbuf). Because operations can be nested, one > > + * should call cr_hbuf_get() to reserve space in the buffer, and then > > + * cr_hbuf_put() when it no longer needs that space > > + */ > > + > > +#include <linux/version.h> > > +#include <linux/sched.h> > > +#include <linux/file.h> > > + > > +#include "ckpt.h" > > +#include "ckpt_hdr.h" > > + > > +/** > > + * cr_hbuf_get - reserve space on the hbuf > > + * @ctx: checkpoint context > > + * @n: number of bytes to reserve > > + */ > > +void *cr_hbuf_get(struct cr_ctx *ctx, int n) > > +{ > > + void *ptr; > > + > > + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL); > > + ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos); > > + ctx->hpos += n; > > + return ptr; > > +} > > Can (ctx->hpos + n > CR_HBUF_TOTAL) be controlled by the input > data? If so, this is a denial-of-service attack. Ugh, this is crappy code anyway. It needs to return an error and have someone else handle it. > > +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count) > > +{ > > + mm_segment_t oldfs; > > + int ret; > > + > > + oldfs = get_fs(); > > + set_fs(KERNEL_DS); > > + ret = cr_uwrite(ctx, buf, count); > > + set_fs(oldfs); > > + > > + return ret; > > +} > > get_fs()/set_fs() always feels a bit ouch, and this way you have > to use __force to avoid the warnings about __user pointer casts > in sparse. > I wonder if you can use splice_read/splice_write to get around > this problem. I have to wonder if this is just a symptom of us trying to do this the wrong way. We're trying to talk the kernel into writing internal gunk into a FD. You're right, it is like a splice where one end of the pipe is in the kernel. Any thoughts on a better way to do this? > > + struct cr_ctx *ctx; > > + struct file *file; > > + int fput_needed; > > + int ret; > > + > > + if (!capable(CAP_SYS_ADMIN)) > > + return -EPERM; > > + > > Why do you need CAP_SYS_ADMIN for this? Can't regular users > be allowed to checkpoint/restart their own tasks? Yes, eventually. I think one good point is that we should probably remove this now so that we *have* to think about security implications as we add each individual patch. For instance, what kind of checking do we do when we restore an mlock()'d VMA? I'll pull this check out so it causes pain. (the good kind) > > --- linux-2.6.git/Makefile~handle_a_single_task_with_private_memory_maps 2008-08-05 09:04:27.000000000 -0700 > > +++ linux-2.6.git-dave/Makefile 2008-08-05 09:07:53.000000000 -0700 > > @@ -611,7 +611,7 @@ export mod_strip_cmd > > > > > > ifeq ($(KBUILD_EXTMOD),) > > -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ > > +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/ > > > > The name 'ckpt' is a bit unobvious, how about naming it 'checkpoint' instead? Fine with me. Renamed in new patches, hopefully. I'll send new patches out later today. -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218221451.19082.36.camel@nimitz>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <1218221451.19082.36.camel@nimitz> @ 2008-08-08 20:59 ` Oren Laadan 2008-08-08 22:13 ` Arnd Bergmann ` (2 subsequent siblings) 3 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-08 20:59 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann Dave Hansen wrote: > On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote: >> On Friday 08 August 2008, Dave Hansen wrote: >>> + hh->magic = 0x00a2d200; >>> + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff; >>> + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff; >>> + hh->patch = (LINUX_VERSION_CODE) & 0xff; > ... >>> +} >> Do you rely on the kernel version in order to determine the format >> of the binary data, or is it just informational? >> >> If you think the format can change in incompatible ways, you >> probably need something more specific than the version number >> these days, because there are just so many different trees with >> the same numbers. > > Yeah, this is very true. My guess is that we'll need something like > what we do with modversions. Exactly. The header should eventually contain sufficient information to describe the kernel version, configuration, compiler, cpu (arch and capabilities), and checkpoint code version. How would you suggest to identify the origin tree with an identifier (or a text field) in the header ? > >>> +/* debugging */ >>> +#if 0 >>> +#define CR_PRINTK(str, args...) \ >>> + printk(KERN_ERR "cr@%s#%d: " str, __func__, __LINE__, ##args) >>> +#else >>> +#define CR_PRINTK(...) do {} while (0) >>> +#endif >>> + >> Please use the existing pr_debug and dev_debug here, instead of creating >> yet another version. > > Sure thing. Will do. > >>> +struct cr_hdr_tail { >>> + __u32 magic; >>> + __u32 cksum[2]; >>> +}; >> This structure has an odd multiple of 32-bit members, which means >> that if you put it into a larger structure that also contains >> 64-bit members, the larger structure may get different alignment >> on x86-32 and x86-64, which you might want to avoid. >> I can't tell if this is an actual problem here. > > Can't we just declare all these things __packed__ and stop worrying > about aligning them all manually? > >>> + >>> +/* >>> + * During restart the code reads in data from the chekcpoint image into a >>> + * temporary buffer (ctx->hbuf). Because operations can be nested, one >>> + * should call cr_hbuf_get() to reserve space in the buffer, and then >>> + * cr_hbuf_put() when it no longer needs that space >>> + */ >>> + >>> +#include <linux/version.h> >>> +#include <linux/sched.h> >>> +#include <linux/file.h> >>> + >>> +#include "ckpt.h" >>> +#include "ckpt_hdr.h" >>> + >>> +/** >>> + * cr_hbuf_get - reserve space on the hbuf >>> + * @ctx: checkpoint context >>> + * @n: number of bytes to reserve >>> + */ >>> +void *cr_hbuf_get(struct cr_ctx *ctx, int n) >>> +{ >>> + void *ptr; >>> + >>> + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL); >>> + ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos); >>> + ctx->hpos += n; >>> + return ptr; >>> +} >> Can (ctx->hpos + n > CR_HBUF_TOTAL) be controlled by the input >> data? If so, this is a denial-of-service attack. > > Ugh, this is crappy code anyway. It needs to return an error and have > someone else handle it. Actually not quite. 'n' is _not_ controlled by the input data, and at the same time ctx->hpos should always carry enough room by design. If that is not the case, then it's a logical bug, not DoS attack. To avoid repetitive malloc/free, ctx->hbuf is a buffer to host headers as they are read; since headers can be read in a nested manner, ctx->hpos points to the next free position in that buffer. So 'n' is the size of the header that we are about to read - decided at compile time, not the user input. The BUG_ON() statement asserts that by design we have enough buffer (like you'd check that you didn't run out of kernel stack...) If it is preferred, we can change this to write a kernel message and return a special error telling that a logical error has occurred. > >>> +int cr_kwrite(struct cr_ctx *ctx, void *buf, int count) >>> +{ >>> + mm_segment_t oldfs; >>> + int ret; >>> + >>> + oldfs = get_fs(); >>> + set_fs(KERNEL_DS); >>> + ret = cr_uwrite(ctx, buf, count); >>> + set_fs(oldfs); >>> + >>> + return ret; >>> +} >> get_fs()/set_fs() always feels a bit ouch, and this way you have >> to use __force to avoid the warnings about __user pointer casts >> in sparse. >> I wonder if you can use splice_read/splice_write to get around >> this problem. > > I have to wonder if this is just a symptom of us trying to do this the > wrong way. We're trying to talk the kernel into writing internal gunk > into a FD. You're right, it is like a splice where one end of the pipe > is in the kernel. > > Any thoughts on a better way to do this? > >>> + struct cr_ctx *ctx; >>> + struct file *file; >>> + int fput_needed; >>> + int ret; >>> + >>> + if (!capable(CAP_SYS_ADMIN)) >>> + return -EPERM; >>> + >> Why do you need CAP_SYS_ADMIN for this? Can't regular users >> be allowed to checkpoint/restart their own tasks? > > Yes, eventually. I think one good point is that we should probably > remove this now so that we *have* to think about security implications > as we add each individual patch. For instance, what kind of checking do > we do when we restore an mlock()'d VMA? > > I'll pull this check out so it causes pain. (the good kind) Hmmm... even if not strictly now, we *will* need admin privileges for the CR operations, for the following reasons: checkpoint: we save the entire state of a set of processes to a file - so we must have privileges to do so, at least within (or with respect to) the said container. Even if we are the user who owns the container, we'll need root access within that container. restart: we restore the entire set of a set of processes, which may require some privileged operations (again, at least within or with respect to the said container). Otherwise any user could inject any restart data into the kernel and create any set of processes with arbitrary permissions. In a sense, we need something similar to granting ptrace access. > >>> --- linux-2.6.git/Makefile~handle_a_single_task_with_private_memory_maps 2008-08-05 09:04:27.000000000 -0700 >>> +++ linux-2.6.git-dave/Makefile 2008-08-05 09:07:53.000000000 -0700 >>> @@ -611,7 +611,7 @@ export mod_strip_cmd >>> >>> >>> ifeq ($(KBUILD_EXTMOD),) >>> -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ >>> +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/ >>> >> The name 'ckpt' is a bit unobvious, how about naming it 'checkpoint' instead? > > Fine with me. Renamed in new patches, hopefully. > > I'll send new patches out later today. > > -- Dave > ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <1218221451.19082.36.camel@nimitz> 2008-08-08 20:59 ` Oren Laadan @ 2008-08-08 22:13 ` Arnd Bergmann [not found] ` <200808090013.41999.arnd@arndb.de> [not found] ` <489CB3CA.6050304@cs.columbia.edu> 3 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-08 22:13 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Friday 08 August 2008, Dave Hansen wrote: > On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote: > > > +struct cr_hdr_tail { > > > + __u32 magic; > > > + __u32 cksum[2]; > > > +}; > > > > This structure has an odd multiple of 32-bit members, which means > > that if you put it into a larger structure that also contains > > 64-bit members, the larger structure may get different alignment > > on x86-32 and x86-64, which you might want to avoid. > > I can't tell if this is an actual problem here. > > Can't we just declare all these things __packed__ and stop worrying > about aligning them all manually? I personally dislike __packed__ because it makes it very easy to get suboptimal object code. If you either pad every structure to a multiple of 64 bits or avoid __u64 members, you don't have a problem. Also, I think avoiding implicit padding inside of data structures is very helpful for user interfaces, if necessary you can always add explicit padding. > > get_fs()/set_fs() always feels a bit ouch, and this way you have > > to use __force to avoid the warnings about __user pointer casts > > in sparse. > > I wonder if you can use splice_read/splice_write to get around > > this problem. > > I have to wonder if this is just a symptom of us trying to do this the > wrong way. We're trying to talk the kernel into writing internal gunk > into a FD. You're right, it is like a splice where one end of the pipe > is in the kernel. > > Any thoughts on a better way to do this? Maybe you can invert the logic and let the new syscalls create a file descriptor, and then have user space read or splice the checkpoint data from it, and restore it by writing to the file descriptor. It's probably easy to do using anon_inode_getfd() and would solve this problem, but at the same time make checkpointing the current thread hard if not impossible. > Yes, eventually. I think one good point is that we should probably > remove this now so that we *have* to think about security implications > as we add each individual patch. For instance, what kind of checking do > we do when we restore an mlock()'d VMA? I think the question can be generalized further: How do you deal with saved tasks that have more priviledges than the task doing the restore? There are probably more, but what I can think of right now includes: * anything you can set using ulimit * capabilities * threads running as another user/group * open files that have had their permissions changed after the open Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808090013.41999.arnd@arndb.de>]
[parent not found: <200808090013.41999.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <200808090013.41999.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-08 22:26 ` Dave Hansen 2008-08-11 15:22 ` Serge E. Hallyn 1 sibling, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-08 22:26 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Sat, 2008-08-09 at 00:13 +0200, Arnd Bergmann wrote: > > I have to wonder if this is just a symptom of us trying to do this the > > wrong way. We're trying to talk the kernel into writing internal gunk > > into a FD. You're right, it is like a splice where one end of the pipe > > is in the kernel. > > > > Any thoughts on a better way to do this? > > Maybe you can invert the logic and let the new syscalls create a file > descriptor, and then have user space read or splice the checkpoint > data from it, and restore it by writing to the file descriptor. > It's probably easy to do using anon_inode_getfd() and would solve this > problem, but at the same time make checkpointing the current thread > hard if not impossible. Yeah, it does seem kinda backwards. But, instead of even having to worry about the anon_inode stuff, why don't we just put it in a fs like everything else? checkpointfs! I'm also really not convinced that putting the entire checkpoint in one glob is really the solution, either. I mean, is system call overhead really a problem here? -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <200808090013.41999.arnd-r2nGTMty4D4@public.gmane.org> 2008-08-08 22:26 ` Dave Hansen @ 2008-08-11 15:22 ` Serge E. Hallyn 1 sibling, 0 replies; 71+ messages in thread From: Serge E. Hallyn @ 2008-08-11 15:22 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen Quoting Arnd Bergmann (arnd-r2nGTMty4D4@public.gmane.org): > On Friday 08 August 2008, Dave Hansen wrote: > > On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote: > > > > > +struct cr_hdr_tail { > > > > + __u32 magic; > > > > + __u32 cksum[2]; > > > > +}; > > > > > > This structure has an odd multiple of 32-bit members, which means > > > that if you put it into a larger structure that also contains > > > 64-bit members, the larger structure may get different alignment > > > on x86-32 and x86-64, which you might want to avoid. > > > I can't tell if this is an actual problem here. > > > > Can't we just declare all these things __packed__ and stop worrying > > about aligning them all manually? > > I personally dislike __packed__ because it makes it very easy to get > suboptimal object code. If you either pad every structure to a multiple > of 64 bits or avoid __u64 members, you don't have a problem. Also, > I think avoiding implicit padding inside of data structures is very > helpful for user interfaces, if necessary you can always add explicit > padding. > > > > get_fs()/set_fs() always feels a bit ouch, and this way you have > > > to use __force to avoid the warnings about __user pointer casts > > > in sparse. > > > I wonder if you can use splice_read/splice_write to get around > > > this problem. > > > > I have to wonder if this is just a symptom of us trying to do this the > > wrong way. We're trying to talk the kernel into writing internal gunk > > into a FD. You're right, it is like a splice where one end of the pipe > > is in the kernel. > > > > Any thoughts on a better way to do this? > > Maybe you can invert the logic and let the new syscalls create a file > descriptor, and then have user space read or splice the checkpoint > data from it, and restore it by writing to the file descriptor. > It's probably easy to do using anon_inode_getfd() and would solve this > problem, but at the same time make checkpointing the current thread > hard if not impossible. > > > Yes, eventually. I think one good point is that we should probably > > remove this now so that we *have* to think about security implications > > as we add each individual patch. For instance, what kind of checking do > > we do when we restore an mlock()'d VMA? > > I think the question can be generalized further: How do you deal with > saved tasks that have more priviledges than the task doing the restore? > > There are probably more, but what I can think of right now includes: > * anything you can set using ulimit > * capabilities > * threads running as another user/group > * open files that have had their permissions changed after the open At the checkpoint end, the ptrace checks seem apporpriate: If you're allowed to stop and manipulate the process, then you may as well be allowed to checkpoint and see/tweak its memory that way. At the restart end, every resource which was checkpointed will have to be re-created, and permissions checked against the privilege of the task which did the restart. We may end up having to make use of the new credentials for this. This could become unpleasant: if an unprivileged task asked a privileged helper to create something for the unprivileged task to use (i.e. a raw socket), then the user needs to be privileged to re-created the resource. But it's necessary. -serge ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218234411.19082.58.camel@nimitz>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <1218234411.19082.58.camel@nimitz> @ 2008-08-08 22:39 ` Arnd Bergmann [not found] ` <200808090039.20289.arnd@arndb.de> ` (2 subsequent siblings) 3 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-08 22:39 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Saturday 09 August 2008, Dave Hansen wrote: > On Sat, 2008-08-09 at 00:13 +0200, Arnd Bergmann wrote: > > Maybe you can invert the logic and let the new syscalls create a file > > descriptor, and then have user space read or splice the checkpoint > > data from it, and restore it by writing to the file descriptor. > > It's probably easy to do using anon_inode_getfd() and would solve this > > problem, but at the same time make checkpointing the current thread > > hard if not impossible. > > Yeah, it does seem kinda backwards. But, instead of even having to > worry about the anon_inode stuff, why don't we just put it in a fs like > everything else? checkpointfs! Well, anon_inodes are really easy to use and have replaced some of the simple non-mountable file systems in the kernel. checkpointfs sounds interesting and I guess in a plan9 world of fairies and fantasy, you should be able to create a checkpoint of your system using 'tar czf - /proc/', but I'm not sure it helps here. The main problem I see with that would be atomicity: If you want multiple processes to keep interacting with each other, you need to save them at the same point in time, which gets harder as you split your interface into more than a single file descriptor. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808090039.20289.arnd@arndb.de>]
[parent not found: <200808090039.20289.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <200808090039.20289.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-09 0:43 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-09 0:43 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Sat, 2008-08-09 at 00:39 +0200, Arnd Bergmann wrote: > The main problem I see with that would be atomicity: If you want multiple > processes to keep interacting with each other, you need to save them at > the same point in time, which gets harder as you split your interface into > more than a single file descriptor. It could take ages to write out a checkpoint even to a single fd, so I suspect we'd have the exact same kinds of issues either way. -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218242614.19082.65.camel@nimitz>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <1218242614.19082.65.camel@nimitz> @ 2008-08-09 6:37 ` Arnd Bergmann [not found] ` <200808090837.07417.arnd@arndb.de> 1 sibling, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-09 6:37 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Saturday 09 August 2008, Dave Hansen wrote: > On Sat, 2008-08-09 at 00:39 +0200, Arnd Bergmann wrote: > > The main problem I see with that would be atomicity: If you want multiple > > processes to keep interacting with each other, you need to save them at > > the same point in time, which gets harder as you split your interface into > > more than a single file descriptor. > > It could take ages to write out a checkpoint even to a single fd, so I > suspect we'd have the exact same kinds of issues either way. I guess either way, you have to SIGSTOP (or similar) all the tasks you want to checkpoint atomically before you start saving the contents. If you use a single fd, you can do that under the covers, when using a more complex file system, it seems more logical to require an explicit interface for this. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808090837.07417.arnd@arndb.de>]
[parent not found: <200808090837.07417.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <200808090837.07417.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-09 13:39 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-09 13:39 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Sat, 2008-08-09 at 08:37 +0200, Arnd Bergmann wrote: > On Saturday 09 August 2008, Dave Hansen wrote: > > On Sat, 2008-08-09 at 00:39 +0200, Arnd Bergmann wrote: > > > The main problem I see with that would be atomicity: If you want multiple > > > processes to keep interacting with each other, you need to save them at > > > the same point in time, which gets harder as you split your interface into > > > more than a single file descriptor. > > > > It could take ages to write out a checkpoint even to a single fd, so I > > suspect we'd have the exact same kinds of issues either way. > > I guess either way, you have to SIGSTOP (or similar) all the tasks you want > to checkpoint atomically before you start saving the contents. > If you use a single fd, you can do that under the covers, when using a > more complex file system, it seems more logical to require an explicit > interface for this. Oh, we're already working on patches to the freezer code to do this for us. There's a branch in here from Matt H. that's doing just that: http://git.kernel.org/?p=linux/kernel/git/daveh/linux-2.6-next-lxc.git;a=shortlog -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <1218234411.19082.58.camel@nimitz> 2008-08-08 22:39 ` Arnd Bergmann [not found] ` <200808090039.20289.arnd@arndb.de> @ 2008-08-11 15:07 ` Serge E. Hallyn [not found] ` <20080811150703.GA25930@us.ibm.com> 3 siblings, 0 replies; 71+ messages in thread From: Serge E. Hallyn @ 2008-08-11 15:07 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org): > On Sat, 2008-08-09 at 00:13 +0200, Arnd Bergmann wrote: > > > I have to wonder if this is just a symptom of us trying to do this the > > > wrong way. We're trying to talk the kernel into writing internal gunk > > > into a FD. You're right, it is like a splice where one end of the pipe > > > is in the kernel. > > > > > > Any thoughts on a better way to do this? > > > > Maybe you can invert the logic and let the new syscalls create a file > > descriptor, and then have user space read or splice the checkpoint > > data from it, and restore it by writing to the file descriptor. > > It's probably easy to do using anon_inode_getfd() and would solve this > > problem, but at the same time make checkpointing the current thread > > hard if not impossible. > > Yeah, it does seem kinda backwards. But, instead of even having to > worry about the anon_inode stuff, why don't we just put it in a fs like > everything else? checkpointfs! One reason is that I suspect that stops us from being able to send that data straight to a pipe to compress and/or send on the network, without hitting local disk. Though if the checkpointfs was ram-based maybe not? As Oren has pointed out before, passing in an fd means we can pass a socket into the syscall. Using the anon_inodes would also prevent that, but if it makes for a cleaner overall solution then I'm not against considering either one of course. > I'm also really not convinced that putting the entire checkpoint in one > glob is really the solution, either. I mean, is system call overhead > really a problem here? > > -- Dave > > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <20080811150703.GA25930@us.ibm.com>]
[parent not found: <20080811150703.GA25930-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <20080811150703.GA25930-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-08-11 15:25 ` Arnd Bergmann 2008-08-14 5:53 ` Pavel Machek 1 sibling, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-11 15:25 UTC (permalink / raw) To: Serge E. Hallyn Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen On Monday 11 August 2008, Serge E. Hallyn wrote: > One reason is that I suspect that stops us from being able to send that > data straight to a pipe to compress and/or send on the network, without > hitting local disk. Though if the checkpointfs was ram-based maybe not? > > As Oren has pointed out before, passing in an fd means we can pass a > socket into the syscall. > > Using the anon_inodes would also prevent that, but if it makes for a > cleaner overall solution then I'm not against considering either one > of course. > With anon_inodes, you can still implement splice_read/splice_write, so you can splice it into a socket. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <20080811150703.GA25930-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2008-08-11 15:25 ` Arnd Bergmann @ 2008-08-14 5:53 ` Pavel Machek 1 sibling, 0 replies; 71+ messages in thread From: Pavel Machek @ 2008-08-14 5:53 UTC (permalink / raw) To: Serge E. Hallyn Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann, Dave Hansen Hi! > > > > I have to wonder if this is just a symptom of us trying to do this the > > > > wrong way. We're trying to talk the kernel into writing internal gunk > > > > into a FD. You're right, it is like a splice where one end of the pipe > > > > is in the kernel. > > > > > > > > Any thoughts on a better way to do this? > > > > > > Maybe you can invert the logic and let the new syscalls create a file > > > descriptor, and then have user space read or splice the checkpoint > > > data from it, and restore it by writing to the file descriptor. > > > It's probably easy to do using anon_inode_getfd() and would solve this > > > problem, but at the same time make checkpointing the current thread > > > hard if not impossible. > > > > Yeah, it does seem kinda backwards. But, instead of even having to > > worry about the anon_inode stuff, why don't we just put it in a fs like > > everything else? checkpointfs! > > One reason is that I suspect that stops us from being able to send that > data straight to a pipe to compress and/or send on the network, without > hitting local disk. Though if the checkpointfs was ram-based maybe not? > > As Oren has pointed out before, passing in an fd means we can pass a > socket into the syscall. If you do pass a socket, will it handle blocking correctly? Getting deadlocked task would be bad. What happens if I try to snapshot into /proc/self/fd/0 ? Or maybe restore from /proc/cmdline? -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <20080814055301.GH6995@ucw.cz>]
[parent not found: <20080814055301.GH6995-+ZI9xUNit7I@public.gmane.org>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <20080814055301.GH6995-+ZI9xUNit7I@public.gmane.org> @ 2008-08-14 15:12 ` Dave Hansen 2008-08-20 21:40 ` Oren Laadan 1 sibling, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-14 15:12 UTC (permalink / raw) To: Pavel Machek Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann On Thu, 2008-08-14 at 07:53 +0200, Pavel Machek wrote: > > As Oren has pointed out before, passing in an fd means we can pass a > > socket into the syscall. > > If you do pass a socket, will it handle blocking correctly? Getting > deadlocked task would be bad. What happens if I try to snapshot into > /proc/self/fd/0 ? Or maybe restore from /proc/cmdline? Heh, that's a good point. What was the other code where we kept coming up with deadlocks like that? Anyone remember? -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <20080814055301.GH6995-+ZI9xUNit7I@public.gmane.org> 2008-08-14 15:12 ` Dave Hansen @ 2008-08-20 21:40 ` Oren Laadan 1 sibling, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-20 21:40 UTC (permalink / raw) To: Pavel Machek Cc: Theodore Tso, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen Pavel Machek wrote: > Hi! > >>>>> I have to wonder if this is just a symptom of us trying to do this the >>>>> wrong way. We're trying to talk the kernel into writing internal gunk >>>>> into a FD. You're right, it is like a splice where one end of the pipe >>>>> is in the kernel. >>>>> >>>>> Any thoughts on a better way to do this? >>>> Maybe you can invert the logic and let the new syscalls create a file >>>> descriptor, and then have user space read or splice the checkpoint >>>> data from it, and restore it by writing to the file descriptor. >>>> It's probably easy to do using anon_inode_getfd() and would solve this >>>> problem, but at the same time make checkpointing the current thread >>>> hard if not impossible. >>> Yeah, it does seem kinda backwards. But, instead of even having to >>> worry about the anon_inode stuff, why don't we just put it in a fs like >>> everything else? checkpointfs! >> One reason is that I suspect that stops us from being able to send that >> data straight to a pipe to compress and/or send on the network, without >> hitting local disk. Though if the checkpointfs was ram-based maybe not? >> >> As Oren has pointed out before, passing in an fd means we can pass a >> socket into the syscall. > > If you do pass a socket, will it handle blocking correctly? Getting > deadlocked task would be bad. What happens if I try to snapshot into > /proc/self/fd/0 ? Or maybe restore from /proc/cmdline? Hmmm... these are good points. Keep in mind that our principal goal is to checkpoint a whole container, rather then a task to checkpoint itself (which is a by-product). Of course your comments apply to a whole container as well. In both cases, I don't think that blocking on a socket is a problem; the checkpointer will enter a TASK_INTERRUPTIBLE state. Where is the deadlock ? Writing or reading to/from /proc/self/... likewise - the programmer must understand the implications, or the program won't work as expected. I don't see a possible deadlock here, though. For example - writing to /proc/self/fd/0 is ok; the state of fd[0] of that task will be captured at some point in the middle of the checkpoint, so after restart one cannot assume anything about the file position; the rest should work. Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <20080811152201.GB25930@us.ibm.com>]
[parent not found: <20080811152201.GB25930-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <20080811152201.GB25930-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-08-11 16:53 ` Arnd Bergmann 0 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-11 16:53 UTC (permalink / raw) To: Serge E. Hallyn Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen On Monday 11 August 2008, Serge E. Hallyn wrote: > At the restart end, every resource which was checkpointed will have to > be re-created, and permissions checked against the privilege of the > task which did the restart. We may end up having to make use of the new > credentials for this. > > This could become unpleasant: if an unprivileged task asked a privileged > helper to create something for the unprivileged task to use (i.e. a > raw socket), then the user needs to be privileged to re-created the > resource. But it's necessary. Right. Of course, the hard part here will be to make it obvious to be safe. Having to check all sorts of permissions means there will be many opportunities for exploitable bugs. The best way I can think of for this would be to use existing syscalls (e.g. sched_setscheduler, setfsuid, ...) from user space whereever possible and do only the bare minimum for the restart part in the kernel. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808111853.13854.arnd@arndb.de>]
[parent not found: <200808111853.13854.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <200808111853.13854.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-11 17:11 ` Dave Hansen 2008-08-11 19:48 ` checkpoint/restart ABI Dave Hansen 1 sibling, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-11 17:11 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, 2008-08-11 at 18:53 +0200, Arnd Bergmann wrote: > The best way I can think of for this would be to use existing syscalls > (e.g. sched_setscheduler, setfsuid, ...) from user space whereever > possible and do only the bare minimum for the restart part in the > kernel. Well, the current direction is about as far away from that as you can get, unless we basically call those system calls from inside our new sys_restart() one. As of now, we're as much work in the kernel as possible, and doing the bare minimum in userspace. That's what both Oren and our OpenVZ colleagues have advocated. -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
* checkpoint/restart ABI [not found] ` <200808111853.13854.arnd-r2nGTMty4D4@public.gmane.org> 2008-08-11 17:11 ` Dave Hansen @ 2008-08-11 19:48 ` Dave Hansen 1 sibling, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-11 19:48 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA Arnd, Jeremy and Oren, Thanks for all of the very interesting comments about the ABI. Considering that we're still *really* early in getting this concept merged up into mainline, what do you all think we should do now? My main goal here is just to get everyone to understand the approach that we're proposing rather than to really fix the interfaces in stone. I bet we're going to be changing them a lot before these patches actually get in. -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218484114.5598.43.camel@nimitz>]
* Re: checkpoint/restart ABI [not found] ` <1218484114.5598.43.camel@nimitz> @ 2008-08-11 21:47 ` Arnd Bergmann 2008-08-11 21:54 ` Oren Laadan ` (3 subsequent siblings) 4 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-11 21:47 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Monday 11 August 2008, Dave Hansen wrote: > Thanks for all of the very interesting comments about the ABI. > > Considering that we're still *really* early in getting this concept > merged up into mainline, what do you all think we should do now? I think the two most important aspects here need to be security and simplicity. If you have to choose between the two, it probably makes sense to put security first, because loading untrusted data into the kernel puts you at a significant risk to start with. If you can show a restart interface that lets regular users restart their tasks in a way anyone can verify to be secure, that will be a good indication that you're on the right track. The other problem that you really need to solve is interface stability. What you are creating is a binary representation of many kernel internal data structures, so in our common rules, you have to make sure that you remain forward and backward compatible. Simply saying that you need to run an identical kernel when restarting from a checkpoint is not enough IMHO. Some more words on specific interfaces that we have discussed: The single-file-descriptor approach has the big advantage of keeping the complexity in one place (the kernel). To be consistent with other kernel interfaces, I would make the kernel hand out a file descriptor, not let the user open a file and pass that into the kernel as you do now. A new file system is a good idea for many complex interfaces that make their way into the kernel, but I don't think it will help in this case. For checkpointing a single task, or even a task with its children, a different interface I could imagine would be to have a new file in procfs per pid that you can read as a pipe giving our the same data that you currently save in the checkpoint file descriptor. It does mean that you won't be able to pass flags down easily (you could write to the pipe before you start reading, but that's not too nice). On the restart side, I think the most consistent interface would be a new binfmt_chkpt implementation that you can use to execve a checkpoint, just like you execute an ELF file today. The binfmt can be a module (unlike a syscall), so an administrator that is afraid of the security implications can just disable it by not loading the module. In an execve model, the parent process can set up anything related to credentials as good as it's allowed to and then let the kernel do the rest. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: checkpoint/restart ABI [not found] ` <1218484114.5598.43.camel@nimitz> 2008-08-11 21:47 ` Arnd Bergmann @ 2008-08-11 21:54 ` Oren Laadan 2008-08-11 23:38 ` Jeremy Fitzhardinge ` (2 subsequent siblings) 4 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-11 21:54 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann Quoting Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>: > Arnd, Jeremy and Oren, > > Thanks for all of the very interesting comments about the ABI. I second that ! > > Considering that we're still *really* early in getting this concept > merged up into mainline, what do you all think we should do now? > > My main goal here is just to get everyone to understand the approach > that we're proposing rather than to really fix the interfaces in stone. > I bet we're going to be changing them a lot before these patches > actually get in. I closely follow the valuable feedback and fix the code accordingly. I propose to extend the proof of concept, to also be able to save and restore "simple" open files (regular files, directories). My motivation is twofold: (1) Providing eough functionality for people to meaningfully play with the proposed patches and try them. (2) Demonstrate how I propose to handle shared resources (open files). The first point is very important because it makes the concept actually useful to a much broader set of programs than otherwise. I hope this will attract a larger audience :) To this end, I have already extended the patchset, and I should be able to send something working in a day or two. Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: checkpoint/restart ABI [not found] ` <1218484114.5598.43.camel@nimitz> 2008-08-11 21:47 ` Arnd Bergmann 2008-08-11 21:54 ` Oren Laadan @ 2008-08-11 23:38 ` Jeremy Fitzhardinge [not found] ` <48A0CD86.6030704@goop.org> [not found] ` <200808112347.50245.arnd@arndb.de> 4 siblings, 0 replies; 71+ messages in thread From: Jeremy Fitzhardinge @ 2008-08-11 23:38 UTC (permalink / raw) To: Dave Hansen Cc: Theodore Tso, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Peter Chubb Dave Hansen wrote: > Arnd, Jeremy and Oren, > > Thanks for all of the very interesting comments about the ABI. > Considering that we're still *really* early in getting this concept > merged up into mainline, what do you all think we should do now? > > My main goal here is just to get everyone to understand the approach > that we're proposing rather than to really fix the interfaces in stone. > I bet we're going to be changing them a lot before these patches > actually get in. > Yes. It seems to me that worrying about ABI at this point is a bit premature. This feature, as it currently stands, is essentially useless for any practical purpose. Self-checkpointing a single process with no handling of non-file file descriptors and no proper handling of file file-descriptors is not very useful. My understanding that this is basically a prototype for a more useful multi-process or container-wide checkpoint facility. While you could try to come up with an extensible file format that would be able to handle any future extensions, the chances are you'd get it wrong and need to break file format compatibility anyway. I'm more interested in seeing a description of how you're doing to handle things like: * multiple processes * pipes * UNIX domain sockets * INET sockets (both inter and intra machine) * unlinked open files * checkpointing file content * closed files (ie, files which aren't currently open, but will be soon, esp tmp files) * shared memory * (Peter, what have I forgotten?) Having gone through this before, I don't think an all-kernel solution can work except for the most simple cases. Which, come to think of it, is an important point. What are the expected use-cases for this feature? Do you really mean checkpoint/restart? Do you expect to be able to checkpoint a process, leave it running, then "rewind" by restoring the image? Or does checkpoint always atomically kill the source process(es)? Are you expecting to be able to resume on another machine? Lightweight filesystem checkpointing, such as btrfs provides, would seem like a powerful mechanism for handling a lot of the filesystem state problems. It would have been useful when we did this... J ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <48A0CD86.6030704@goop.org>]
[parent not found: <48A0CD86.6030704-TSDbQ3PG+2Y@public.gmane.org>]
* Re: checkpoint/restart ABI [not found] ` <48A0CD86.6030704-TSDbQ3PG+2Y@public.gmane.org> @ 2008-08-11 23:54 ` Peter Chubb 2008-08-12 14:58 ` Dave Hansen 1 sibling, 0 replies; 71+ messages in thread From: Peter Chubb @ 2008-08-11 23:54 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Theodore Tso, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen, Peter Chubb >>>>> "Jeremy" == Jeremy Fitzhardinge <jeremy-TSDbQ3PG+2Y@public.gmane.org> writes: Jeremy> Dave Hansen wrote: >> Arnd, Jeremy and Oren, >> Jeremy> * multiple processes * pipes * UNIX domain sockets * INET Jeremy> sockets (both inter and intra machine) * unlinked open files * Jeremy> checkpointing file content * closed files (ie, files which Jeremy> aren't currently open, but will be soon, esp tmp files) * Jeremy> shared memory * (Peter, what have I forgotten?) File sharing; multiple threads with wierd sharing arrangements (think: clone with various parameters, followed by exec in some of the threads but not others); MERT/system-V shared memory, semaphores and message queues; devices (audio, framebuffer, etc), HugeTLBFS, numa issues (pinning, memory layout), processes being debugged (so, checkpoint.restart a gdb/target pair), futexes, etc., etc. Linux process state keeps expanding. Jeremy> Having gone through this before, I don't think an all-kernel Jeremy> solution can work except for the most simple cases. I agree ... it's better to put mechanisms into the kernel that can then be used by a user-space programme to actually do the checkpointing and restarting. Beefing up ptrace or fixing /proc to be a real debugging interface would be a start ... when you can get at *all* the info you need, quickly and easily, the userspace checkpoint falls out fairly naturally. You still have to work out an extensible file format to store stuff, and how to restore all that state you've so lovingly collected. Jeremy> Lightweight filesystem checkpointing, such as btrfs provides, Jeremy> would seem like a powerful mechanism for handling a lot of the Jeremy> filesystem state problems. It would have been useful when we Jeremy> did this... And how! saving bits of files was very timeconsuming. -- Dr Peter Chubb http://www.gelato.unsw.edu.au peterc AT gelato.unsw.edu.au http://www.ertos.nicta.com.au ERTOS within National ICT Australia ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: checkpoint/restart ABI [not found] ` <48A0CD86.6030704-TSDbQ3PG+2Y@public.gmane.org> 2008-08-11 23:54 ` Peter Chubb @ 2008-08-12 14:58 ` Dave Hansen 1 sibling, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-12 14:58 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Theodore Tso, Daniel Lezcano, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Peter Chubb On Mon, 2008-08-11 at 16:38 -0700, Jeremy Fitzhardinge wrote: > This feature, as it currently stands, is essentially useless for any > practical purpose. Self-checkpointing a single process with no handling > of non-file file descriptors and no proper handling of file > file-descriptors is not very useful. > > My understanding that this is basically a prototype for a more useful > multi-process or container-wide checkpoint facility. Yes, that's exactly it. We're diverging from discussing the important bits as it is, and I think we'd do that more and more with extra code. :) > While you could try to come up with an extensible file format that would > be able to handle any future extensions, the chances are you'd get it > wrong and need to break file format compatibility anyway. Amen to that. I won't speak for the rest of the whackos interested in this stuff, but I *KNOW* I'm not clever enough to pull it off. > I'm more interested in seeing a description of how you're doing to > handle things like: > > * multiple processes > * pipes > * UNIX domain sockets > * INET sockets (both inter and intra machine) > * unlinked open files > * checkpointing file content > * closed files (ie, files which aren't currently open, but will be > soon, esp tmp files) > * shared memory > * (Peter, what have I forgotten?) > > Having gone through this before, I don't think an all-kernel solution > can work except for the most simple cases. So, there's a lot of stuff there. The networking stuff is way out of my league, so I'll cc Daniel and make him answer. :) All of the other stuff has been done in various in-kernel implementations. OpenVZ, IBM's Metacluster, Zap (Oren's work at Columbia). Most of it *can* be done from userspace, but some of it is very painful. There are some good OLS papers describing most of these things. Zap might have had one or two academic papers written about it. Maybe. ;) Unlinked files, for instance, are actually available in /proc. You can freeze the app, write a helper that opens /proc/1234/fd, then copies its contents to a linked file (ooooh, with splice!) Anyway, if we can do it in userspace, we can surely do it in the kernel. I'm not sure what you mean by "closed files". Either the app has a fd, it doesn't, or it is in sys_open() somewhere. We have to get the app into a quiescent state before we can checkpoint, so we basically just say that we won't checkpoint things that are *in* the kernel. Is there anything specific you are thinking of that particularly worries you? I could write pages on the list you have there. > Which, come to think of it, is an important point. What are the > expected use-cases for this feature? Do you really mean > checkpoint/restart? Do you expect to be able to checkpoint a process, > leave it running, then "rewind" by restoring the image? Or does > checkpoint always atomically kill the source process(es)? Are you > expecting to be able to resume on another machine? Yes. We all want different things, and there are a lot of people interested in this stuff. So, I think all of what you've mentioned above are goals, at least long term. Some, *really* long term. I don't want to get into a full virtualization vs. containers debate, but we also want it for all the same reasons that you migrate Xen partitions. > Lightweight filesystem checkpointing, such as btrfs provides, would seem > like a powerful mechanism for handling a lot of the filesystem state > problems. It would have been useful when we did this... Yup. We were just chatting about that with some filesystem folks last week. But, as the OpenVZ dudes like to mention, the poor man's way of moving filesystem snapshots around is always rsync. -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <87d4kfds5i.wl%peterc@chubb.wattle.id.au>]
[parent not found: <87d4kfds5i.wl%peterc-LkDQP0DxSMGxwJ88Py/mJxCuuivNXqWP@public.gmane.org>]
* Re: checkpoint/restart ABI [not found] ` <87d4kfds5i.wl%peterc-LkDQP0DxSMGxwJ88Py/mJxCuuivNXqWP@public.gmane.org> @ 2008-08-12 14:49 ` Serge E. Hallyn 2008-08-12 15:11 ` Dave Hansen 1 sibling, 0 replies; 71+ messages in thread From: Serge E. Hallyn @ 2008-08-12 14:49 UTC (permalink / raw) To: Peter Chubb Cc: Jeremy Fitzhardinge, Theodore Tso, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen Quoting Peter Chubb (peterc-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org): > >>>>> "Jeremy" == Jeremy Fitzhardinge <jeremy-TSDbQ3PG+2Y@public.gmane.org> writes: > > Jeremy> Dave Hansen wrote: > >> Arnd, Jeremy and Oren, > >> > > > Jeremy> * multiple processes * pipes * UNIX domain sockets * INET > Jeremy> sockets (both inter and intra machine) * unlinked open files * > Jeremy> checkpointing file content * closed files (ie, files which > Jeremy> aren't currently open, but will be soon, esp tmp files) * > Jeremy> shared memory * (Peter, what have I forgotten?) > > File sharing; multiple threads with wierd sharing arrangements (think: > clone with various parameters, followed by exec in some of the threads > but not others); MERT/system-V shared memory, semaphores and message > queues; devices (audio, framebuffer, etc), HugeTLBFS, numa issues > (pinning, memory layout), processes being debugged (so, > checkpoint.restart a gdb/target pair), futexes, etc., etc. Linux > process state keeps expanding. > > Jeremy> Having gone through this before, I don't think an all-kernel > Jeremy> solution can work except for the most simple cases. > > I agree ... it's better to put mechanisms into the kernel that can > then be used by a user-space programme to actually do the > checkpointing and restarting. > > Beefing up ptrace or fixing /proc to be a real debugging interface > would be a start ... when you can get at *all* the info you need, Except we don't really want to export all the info you need for a complete restartable checkpoint. And especially not make it generally writable. We have also started down that path using ptrace (see cryo, at git://git.sr71.net/~hallyn/cryodev.git). Right before the containers mini-summit, where the general agreement was that a complete in-kernel solution ought to be pursued, I had tried a restart using a binary format that read a checkpoint file and used cryo (userspace using ptrace) for the rest of the restart, only because there was no other reasonable way to set tsk->did_exec on restart. > quickly and easily, the userspace checkpoint falls out fairly > naturally. You still have to work out an extensible file format to > store stuff, and how to restore all that state you've so lovingly > collected. > > Jeremy> Lightweight filesystem checkpointing, such as btrfs provides, > Jeremy> would seem like a powerful mechanism for handling a lot of the > Jeremy> filesystem state problems. It would have been useful when we > Jeremy> did this... > > And how! saving bits of files was very timeconsuming. Yes, we're looking forward to using btrfs' snapshots :) -serge ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: checkpoint/restart ABI [not found] ` <87d4kfds5i.wl%peterc-LkDQP0DxSMGxwJ88Py/mJxCuuivNXqWP@public.gmane.org> 2008-08-12 14:49 ` Serge E. Hallyn @ 2008-08-12 15:11 ` Dave Hansen 1 sibling, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-12 15:11 UTC (permalink / raw) To: Peter Chubb Cc: Jeremy Fitzhardinge, Theodore Tso, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Arnd Bergmann, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Tue, 2008-08-12 at 09:54 +1000, Peter Chubb wrote: > Jeremy> * multiple processes * pipes * UNIX domain sockets * INET > Jeremy> sockets (both inter and intra machine) * unlinked open files * > Jeremy> checkpointing file content * closed files (ie, files which > Jeremy> aren't currently open, but will be soon, esp tmp files) * > Jeremy> shared memory * (Peter, what have I forgotten?) > > File sharing; multiple threads with wierd sharing arrangements (think: > clone with various parameters, followed by exec in some of the threads > but not others); MERT/system-V shared memory, semaphores and message > queues; devices (audio, framebuffer, etc), HugeTLBFS, numa issues > (pinning, memory layout), processes being debugged (so, > checkpoint.restart a gdb/target pair), futexes, etc., etc. Linux > process state keeps expanding. Yep, these are all challenges. If you have some really specific questions, or things you truly think can't be done, please speak up. But, I really don't see any show stoppers in your list. We also plan to do this incrementally. The first consumers are likely to be dumb, simple HPC apps that don't have real hardware like audio or video. Eventually, we'll get to real hardware like infiniband (ugh) or audio. Eventually. (Actually futexes aren't that bad because they don't keep state in-kernel) -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <20080812144905.GA16016@us.ibm.com>]
[parent not found: <20080812144905.GA16016-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: checkpoint/restart ABI [not found] ` <20080812144905.GA16016-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-08-28 23:40 ` Eric W. Biederman 0 siblings, 0 replies; 71+ messages in thread From: Eric W. Biederman @ 2008-08-28 23:40 UTC (permalink / raw) To: Serge E. Hallyn Cc: Jeremy Fitzhardinge, Theodore Tso, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen, Peter Chubb "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: > Quoting Peter Chubb (peterc-M3ycANVxPotyL3EAZA59ERCuuivNXqWP@public.gmane.org): >> Beefing up ptrace or fixing /proc to be a real debugging interface >> would be a start ... when you can get at *all* the info you need, > > Except we don't really want to export all the info you need for a > complete restartable checkpoint. And especially not make it > generally writable. That and unless we get a lot of synergy from authors of debuggers and debugging code it is a more general and slower interface for no apparent gain. > We have also started down that path using ptrace (see cryo, at > git://git.sr71.net/~hallyn/cryodev.git). > > Right before the containers mini-summit, where the general agreement was > that a complete in-kernel solution ought to be pursued, I had tried > a restart using a binary format that read a checkpoint file and used > cryo (userspace using ptrace) for the rest of the restart, only > because there was no other reasonable way to set tsk->did_exec on > restart. Can we please describe this as the giant syscall approach. Instead of a complete in-kernel solution. There are things like filesystems that should be checkpointed separately, or not checkpointed at all. However there is a large set of processes and process state that always goes together and if you checkpoint a container you always want. So building something that is roughly equivalent to a binfmt module but that can save and restore multiple tasks with a single operation looks like the right granularity. >> Jeremy> Lightweight filesystem checkpointing, such as btrfs provides, >> Jeremy> would seem like a powerful mechanism for handling a lot of the >> Jeremy> filesystem state problems. It would have been useful when we >> Jeremy> did this... >> >> And how! saving bits of files was very timeconsuming. > > Yes, we're looking forward to using btrfs' snapshots :) Yep. And in the case of migration we don't even need to snapshot a filesystem just mount it from on the target machine. Except for the unlinked files challenge. Eric ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218553091.5598.76.camel@nimitz>]
* Re: checkpoint/restart ABI [not found] ` <1218553091.5598.76.camel@nimitz> @ 2008-08-12 16:32 ` Jeremy Fitzhardinge [not found] ` <48A1BB39.3090108@goop.org> 1 sibling, 0 replies; 71+ messages in thread From: Jeremy Fitzhardinge @ 2008-08-12 16:32 UTC (permalink / raw) To: Dave Hansen Cc: Theodore Tso, Daniel Lezcano, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Peter Chubb Dave Hansen wrote: >> I'm more interested in seeing a description of how you're doing to >> handle things like: >> >> * multiple processes >> * pipes >> * UNIX domain sockets >> * INET sockets (both inter and intra machine) >> * unlinked open files >> * checkpointing file content >> * closed files (ie, files which aren't currently open, but will be >> soon, esp tmp files) >> * shared memory >> * (Peter, what have I forgotten?) >> >> Having gone through this before, I don't think an all-kernel solution >> can work except for the most simple cases. >> > > So, there's a lot of stuff there. The networking stuff is way out of my > league, so I'll cc Daniel and make him answer. :) > Inter-machine networking stuff is hard because its outside the checkpointed set, so the checkpoint is observable. Migration is easier, in principle, because you might be able to shift the connection endpoint without bringing it down. Dealing with networking within your checkpointed set is just fiddly, particularly remembering and restoring all the details of things like urgent messages, on-the-fly file descriptors, packet boundaries, etc. > Unlinked files, for instance, are actually available in /proc. You can > freeze the app, write a helper that opens /proc/1234/fd, then copies its > contents to a linked file (ooooh, with splice!) Anyway, if we can do it > in userspace, we can surely do it in the kernel. > Sure, there's no inherent problem. But do you imagine including the file contents within your checkpoint image, or would they be saved separately? > I'm not sure what you mean by "closed files". Either the app has a fd, > it doesn't, or it is in sys_open() somewhere. We have to get the app > into a quiescent state before we can checkpoint, so we basically just > say that we won't checkpoint things that are *in* the kernel. > It's common for an app to write a tmp file, close it, and then open it a bit later expecting to find the content it just wrote. If you checkpoint-kill it in the interim, reboot (clearing out /tmp) and then resume, then it will lose its tmp file. There's no explicit connection between the process and its potential working set of files. We had to deal with it by setting a bunch of policy files to tell the checkpoint/restart system what filename patterns it had to look out for. But if you just checkpoint the whole filesystem state along with the process(es), then perhaps it isn't an issue. > Is there anything specific you are thinking of that particularly worries > you? I could write pages on the list you have there. > No, that's the problem; it all worries me. It's a big problem space. >> Which, come to think of it, is an important point. What are the >> expected use-cases for this feature? Do you really mean >> checkpoint/restart? Do you expect to be able to checkpoint a process, >> leave it running, then "rewind" by restoring the image? Or does >> checkpoint always atomically kill the source process(es)? Are you >> expecting to be able to resume on another machine? >> > > Yes. > > We all want different things, and there are a lot of people interested > in this stuff. So, I think all of what you've mentioned above are > goals, at least long term. Some, *really* long term. > So, in other words: whoever wants to work on it gets to define (their) goals. Fair enough. > I don't want to get into a full virtualization vs. containers debate, > but we also want it for all the same reasons that you migrate Xen > partitions. > No, I don't have any real opinion about containers vs virtualization. I think they're quite distinct solutions for distinct problems. But I was involved in the design and implementation of a checkpoint-restart system (along with Peter Chubb), and have the scars to prove it. We implemented it for IRIX; we called it Hibernator, and licensed it to SGI for a while (I don't remember what name they marketed it under). The list of problems that Peter and I mentioned are ones we had to solve (or, in some cases, failed to solve) to get a workable system. J ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <48A1BB39.3090108@goop.org>]
[parent not found: <48A1BB39.3090108-TSDbQ3PG+2Y@public.gmane.org>]
* Re: checkpoint/restart ABI [not found] ` <48A1BB39.3090108-TSDbQ3PG+2Y@public.gmane.org> @ 2008-08-12 16:46 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-12 16:46 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Theodore Tso, Daniel Lezcano, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Peter Chubb On Tue, 2008-08-12 at 09:32 -0700, Jeremy Fitzhardinge wrote: > Inter-machine networking stuff is hard because its outside the > checkpointed set, so the checkpoint is observable. Migration is easier, > in principle, because you might be able to shift the connection endpoint > without bringing it down. Dealing with networking within your > checkpointed set is just fiddly, particularly remembering and restoring > all the details of things like urgent messages, on-the-fly file > descriptors, packet boundaries, etc. All true. Hard stuff. The IBM product works partly by limiting migrations to occurring on a single physical ethernet network. Each container gets its own IP and MAC address. The socket state is checkpointed quite fully and moved along with the IP. > > Unlinked files, for instance, are actually available in /proc. You can > > freeze the app, write a helper that opens /proc/1234/fd, then copies its > > contents to a linked file (ooooh, with splice!) Anyway, if we can do it > > in userspace, we can surely do it in the kernel. > > Sure, there's no inherent problem. But do you imagine including the > file contents within your checkpoint image, or would they be saved > separately? Me, personally, I think I'd probably "re-link" the thing, mark it as such, ship it across like a normal file, then unlink it after the restore. I don't know what we'd choose when actually implementing it. > > I'm not sure what you mean by "closed files". Either the app has a fd, > > it doesn't, or it is in sys_open() somewhere. We have to get the app > > into a quiescent state before we can checkpoint, so we basically just > > say that we won't checkpoint things that are *in* the kernel. > > It's common for an app to write a tmp file, close it, and then open it a > bit later expecting to find the content it just wrote. If you > checkpoint-kill it in the interim, reboot (clearing out /tmp) and then > resume, then it will lose its tmp file. There's no explicit connection > between the process and its potential working set of files. I respectfully disagree. The number one prerequisite for checkpoint/restart is isolation. Xen just happens to get this for free. So, instead of saying that there's no explicit connection between the process and its working set, ask yourself how we make a connection. In this case, we can do it with a filesystem (mount) namespace. Each container that we might want to checkpoint must have its writable filesystems contained to a private set that are not shared with other containers. Things like union mounts would help here, but aren't necessarily required. They just make it more efficient. > We had to > deal with it by setting a bunch of policy files to tell the > checkpoint/restart system what filename patterns it had to look out > for. But if you just checkpoint the whole filesystem state along with > the process(es), then perhaps it isn't an issue. Right. We just start with "everybody has their own disk" which is slow and crappy and optimize it from there. > > Is there anything specific you are thinking of that particularly worries > > you? I could write pages on the list you have there. > > No, that's the problem; it all worries me. It's a big problem space. It's almost as big of a problem as trying to virtualize entire machines and expecting them to run as fast as native. :) > > I don't want to get into a full virtualization vs. containers debate, > > but we also want it for all the same reasons that you migrate Xen > > partitions. > > > No, I don't have any real opinion about containers vs virtualization. I > think they're quite distinct solutions for distinct problems. > > But I was involved in the design and implementation of a > checkpoint-restart system (along with Peter Chubb), and have the scars > to prove it. We implemented it for IRIX; we called it Hibernator, and > licensed it to SGI for a while (I don't remember what name they marketed > it under). The list of problems that Peter and I mentioned are ones we > had to solve (or, in some cases, failed to solve) to get a workable system. Cool! I didn't know you guys did the IRIX implementation. I'm sure you guys got a lot farther than any of us are. Did you guys ever write any papers or anything on it? I'd be interested in more information. -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218559619.5598.97.camel@nimitz>]
* Re: checkpoint/restart ABI [not found] ` <1218559619.5598.97.camel@nimitz> @ 2008-08-12 17:04 ` Jeremy Fitzhardinge [not found] ` <48A1C2B9.9070107@goop.org> ` (2 subsequent siblings) 3 siblings, 0 replies; 71+ messages in thread From: Jeremy Fitzhardinge @ 2008-08-12 17:04 UTC (permalink / raw) To: Dave Hansen Cc: Theodore Tso, Daniel Lezcano, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Peter Chubb Dave Hansen wrote: >>> I'm not sure what you mean by "closed files". Either the app has a fd, >>> it doesn't, or it is in sys_open() somewhere. We have to get the app >>> into a quiescent state before we can checkpoint, so we basically just >>> say that we won't checkpoint things that are *in* the kernel. >>> >> It's common for an app to write a tmp file, close it, and then open it a >> bit later expecting to find the content it just wrote. If you >> checkpoint-kill it in the interim, reboot (clearing out /tmp) and then >> resume, then it will lose its tmp file. There's no explicit connection >> between the process and its potential working set of files. >> > > I respectfully disagree. The number one prerequisite for > checkpoint/restart is isolation. Xen just happens to get this for free. > (I don't have my Xen hat on at all for this thread.) > So, instead of saying that there's no explicit connection between the > process and its working set, ask yourself how we make a connection. > > In this case, we can do it with a filesystem (mount) namespace. Each > container that we might want to checkpoint must have its writable > filesystems contained to a private set that are not shared with other > containers. Things like union mounts would help here, but aren't > necessarily required. They just make it more efficient. > We were dealing with checkpointing random sets of processes, and that posed all sorts of problems. Filesystem namespace was one, the pid namespace was another. Doing checkpointing at the container-level granularity definitely solves a lot of problems. >>> Is there anything specific you are thinking of that particularly worries >>> you? I could write pages on the list you have there. >>> >> No, that's the problem; it all worries me. It's a big problem space. >> > > It's almost as big of a problem as trying to virtualize entire machines > and expecting them to run as fast as native. :) > No, it's much harder. Hardware is relatively simple and immutable compared to kernel and process state ;) > Cool! I didn't know you guys did the IRIX implementation. I'm sure you > guys got a lot farther than any of us are. Did you guys ever write any > papers or anything on it? I'd be interested in more information. > Yeah, there was a paper, but it looks like the internet has lost it. It was at http://www.csu.edu.au/special/conference/apwww95/.papers95/cmaltby/cmaltby.ps http://www.csu.edu.au/special/conference/apwww95/sept-all.html has mention of the paper. J ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <48A1C2B9.9070107@goop.org>]
[parent not found: <48A1C2B9.9070107-TSDbQ3PG+2Y@public.gmane.org>]
* Re: checkpoint/restart ABI [not found] ` <48A1C2B9.9070107-TSDbQ3PG+2Y@public.gmane.org> @ 2008-08-20 21:52 ` Oren Laadan 0 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-20 21:52 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Theodore Tso, Daniel Lezcano, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen, Peter Chubb Jeremy Fitzhardinge wrote: > Dave Hansen wrote: >>>> I'm not sure what you mean by "closed files". Either the app has a fd, >>>> it doesn't, or it is in sys_open() somewhere. We have to get the app >>>> into a quiescent state before we can checkpoint, so we basically just >>>> say that we won't checkpoint things that are *in* the kernel. >>>> >>> It's common for an app to write a tmp file, close it, and then open it a >>> bit later expecting to find the content it just wrote. If you >>> checkpoint-kill it in the interim, reboot (clearing out /tmp) and then >>> resume, then it will lose its tmp file. There's no explicit connection >>> between the process and its potential working set of files. >>> >> I respectfully disagree. The number one prerequisite for >> checkpoint/restart is isolation. Xen just happens to get this for free. >> > > (I don't have my Xen hat on at all for this thread.) > >> So, instead of saying that there's no explicit connection between the >> process and its working set, ask yourself how we make a connection. >> >> In this case, we can do it with a filesystem (mount) namespace. Each >> container that we might want to checkpoint must have its writable >> filesystems contained to a private set that are not shared with other >> containers. Things like union mounts would help here, but aren't >> necessarily required. They just make it more efficient. >> > > We were dealing with checkpointing random sets of processes, and that > posed all sorts of problems. Filesystem namespace was one, the pid > namespace was another. Doing checkpointing at the container-level > granularity definitely solves a lot of problems. > >>>> Is there anything specific you are thinking of that particularly worries >>>> you? I could write pages on the list you have there. >>>> >>> No, that's the problem; it all worries me. It's a big problem space. >>> >> It's almost as big of a problem as trying to virtualize entire machines >> and expecting them to run as fast as native. :) >> > > No, it's much harder. Hardware is relatively simple and immutable > compared to kernel and process state ;) > >> Cool! I didn't know you guys did the IRIX implementation. I'm sure you >> guys got a lot farther than any of us are. Did you guys ever write any >> papers or anything on it? I'd be interested in more information. >> > > Yeah, there was a paper, but it looks like the internet has lost it. It > was at > http://www.csu.edu.au/special/conference/apwww95/.papers95/cmaltby/cmaltby.ps > http://www.csu.edu.au/special/conference/apwww95/sept-all.html has > mention of the paper. > you can find it here: http://ertos.nicta.com.au/publications/papers/Maltby_Chubb_95.pdf Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: checkpoint/restart ABI [not found] ` <1218559619.5598.97.camel@nimitz> 2008-08-12 17:04 ` Jeremy Fitzhardinge [not found] ` <48A1C2B9.9070107@goop.org> @ 2008-08-20 21:54 ` Oren Laadan [not found] ` <48AC929C.9030901@cs.columbia.edu> 3 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-20 21:54 UTC (permalink / raw) To: Dave Hansen Cc: Jeremy Fitzhardinge, Theodore Tso, Daniel Lezcano, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Peter Chubb Dave Hansen wrote: > On Tue, 2008-08-12 at 09:32 -0700, Jeremy Fitzhardinge wrote: >> Inter-machine networking stuff is hard because its outside the >> checkpointed set, so the checkpoint is observable. Migration is easier, >> in principle, because you might be able to shift the connection endpoint >> without bringing it down. Dealing with networking within your >> checkpointed set is just fiddly, particularly remembering and restoring >> all the details of things like urgent messages, on-the-fly file >> descriptors, packet boundaries, etc. > > All true. Hard stuff. > > The IBM product works partly by limiting migrations to occurring on a > single physical ethernet network. Each container gets its own IP and > MAC address. The socket state is checkpointed quite fully and moved > along with the IP. > >>> Unlinked files, for instance, are actually available in /proc. You can >>> freeze the app, write a helper that opens /proc/1234/fd, then copies its >>> contents to a linked file (ooooh, with splice!) Anyway, if we can do it >>> in userspace, we can surely do it in the kernel. >> Sure, there's no inherent problem. But do you imagine including the >> file contents within your checkpoint image, or would they be saved >> separately? > > Me, personally, I think I'd probably "re-link" the thing, mark it as > such, ship it across like a normal file, then unlink it after the > restore. I don't know what we'd choose when actually implementing it. Re-linking works well when the file system supports that - some do not allow this, in which case you need to silently rename instead of really un-linking (even with NFS), or copy the entire contents. Of course, you also need a snapshot of the file system in case it changes after the checkpoint is taken, or take other measures. We can safely defer addressing this for later. > >>> I'm not sure what you mean by "closed files". Either the app has a fd, >>> it doesn't, or it is in sys_open() somewhere. We have to get the app >>> into a quiescent state before we can checkpoint, so we basically just >>> say that we won't checkpoint things that are *in* the kernel. >> It's common for an app to write a tmp file, close it, and then open it a >> bit later expecting to find the content it just wrote. If you >> checkpoint-kill it in the interim, reboot (clearing out /tmp) and then >> resume, then it will lose its tmp file. There's no explicit connection >> between the process and its potential working set of files. > > I respectfully disagree. The number one prerequisite for > checkpoint/restart is isolation. Xen just happens to get this for free. > So, instead of saying that there's no explicit connection between the > process and its working set, ask yourself how we make a connection. > > In this case, we can do it with a filesystem (mount) namespace. Each > container that we might want to checkpoint must have its writable > filesystems contained to a private set that are not shared with other > containers. Things like union mounts would help here, but aren't > necessarily required. They just make it more efficient. > >> We had to >> deal with it by setting a bunch of policy files to tell the >> checkpoint/restart system what filename patterns it had to look out >> for. But if you just checkpoint the whole filesystem state along with >> the process(es), then perhaps it isn't an issue. > > Right. We just start with "everybody has their own disk" which is slow > and crappy and optimize it from there. Yep. [SNIP] Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <48AC929C.9030901@cs.columbia.edu>]
[parent not found: <48AC929C.9030901-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: checkpoint/restart ABI [not found] ` <48AC929C.9030901-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2008-08-20 22:11 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-20 22:11 UTC (permalink / raw) To: Oren Laadan Cc: Jeremy Fitzhardinge, Theodore Tso, Daniel Lezcano, Arnd Bergmann, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Peter Chubb On Wed, 2008-08-20 at 17:54 -0400, Oren Laadan wrote: > > Me, personally, I think I'd probably "re-link" the thing, mark it as > > such, ship it across like a normal file, then unlink it after the > > restore. I don't know what we'd choose when actually implementing > it. > > Re-linking works well when the file system supports that - some do not > allow this, in which case you need to silently rename instead of really > un-linking (even with NFS), or copy the entire contents. Yeah, it will certainly be fs-dependent. This might be a good application for splice. open("/tmp/linked-newfile", O_RDONLY, perms); splice(unlinked_fd, NULL, new_fd, NULL, MAX_INT, SPLICE_F_MOVE); I'm not sure if it can re-use the blocks on the fs for this, but it probably doesn't matter. -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808112347.50245.arnd@arndb.de>]
[parent not found: <20080811171433.2ce81f28@bike.lwn.net>]
[parent not found: <20080811171433.2ce81f28-vw3g6Xz/EtPk1uMJSBkQmQ@public.gmane.org>]
* Re: checkpoint/restart ABI [not found] ` <20080811171433.2ce81f28-vw3g6Xz/EtPk1uMJSBkQmQ@public.gmane.org> @ 2008-08-11 23:23 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-11 23:23 UTC (permalink / raw) To: Jonathan Corbet Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann On Mon, 2008-08-11 at 17:14 -0600, Jonathan Corbet wrote: > On Mon, 11 Aug 2008 23:47:49 +0200 > Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote: > > > The other problem that you really need to solve is interface > > stability. What you are creating is a binary representation > > of many kernel internal data structures, so in our common > > rules, you have to make sure that you remain forward and > > backward compatible. Simply saying that you need to run > > an identical kernel when restarting from a checkpoint is not > > enough IMHO. > > OTOH, making one of these checkpoint files go into any 2.6.x kernel > seems like a very high bar, to the point, perhaps, of killing this > feature entirely. The OpenVZ dudes like refer to something that Andrew Morton said about this (paraphrasing...): if we need cross-version restore support, we can count on userspace to do the conversion. You can almost think of it like the crashdump processing utility that we have. Instead of worrying about having the kernel *always* produce the same crashdump with the same gunk in it, we make userspace do all the parsing and interpretation. It also makes it quite possible for a distribution to make a change (say because of a security fix) in the kernel that changes the checkpoint format, then to quickly code up the necessary bits for the conversion program. -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808112347.50245.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: checkpoint/restart ABI [not found] ` <200808112347.50245.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-11 23:14 ` Jonathan Corbet 2008-08-21 5:56 ` Oren Laadan 1 sibling, 0 replies; 71+ messages in thread From: Jonathan Corbet @ 2008-08-11 23:14 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen On Mon, 11 Aug 2008 23:47:49 +0200 Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org> wrote: > The other problem that you really need to solve is interface > stability. What you are creating is a binary representation > of many kernel internal data structures, so in our common > rules, you have to make sure that you remain forward and > backward compatible. Simply saying that you need to run > an identical kernel when restarting from a checkpoint is not > enough IMHO. OTOH, making one of these checkpoint files go into any 2.6.x kernel seems like a very high bar, to the point, perhaps, of killing this feature entirely. There could be a case for viewing sys_restore() as being a lot like sys_init_module() - a view into kernel internals that goes beyond the normal user-space ABI, and beyond the stability guarantee. It might be possible to create a certain amount of version portability with a modversions-like mechanism, but it sure seems hard to do better than that. jon ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: checkpoint/restart ABI [not found] ` <200808112347.50245.arnd-r2nGTMty4D4@public.gmane.org> 2008-08-11 23:14 ` Jonathan Corbet @ 2008-08-21 5:56 ` Oren Laadan [not found] ` <48AD0379.9030705-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> [not found] ` <200808211043.41387.arnd@arndb.de> 1 sibling, 2 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-21 5:56 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen Arnd Bergmann wrote: > On Monday 11 August 2008, Dave Hansen wrote: >> Thanks for all of the very interesting comments about the ABI. >> >> Considering that we're still *really* early in getting this concept >> merged up into mainline, what do you all think we should do now? > > I think the two most important aspects here need to be security and > simplicity. If you have to choose between the two, it probably makes > sense to put security first, because loading untrusted data into > the kernel puts you at a significant risk to start with. If you > can show a restart interface that lets regular users restart their > tasks in a way anyone can verify to be secure, that will be a > good indication that you're on the right track. > > The other problem that you really need to solve is interface > stability. What you are creating is a binary representation > of many kernel internal data structures, so in our common > rules, you have to make sure that you remain forward and > backward compatible. Simply saying that you need to run > an identical kernel when restarting from a checkpoint is not > enough IMHO. > quoting: > There could be a case for viewing sys_restore() as being a lot like > sys_init_module() - a view into kernel internals that goes beyond the > normal user-space ABI, and beyond the stability guarantee. It might be > possible to create a certain amount of version portability with a > modversions-like mechanism, but it sure seems hard to do better than > that. > > jon Extending this view in the context of security - we can require sysadmin privilege to restart, and then sysadmin is responsible for the contents of the file. The kernel will ensure the the data isn't corrupted. Much like with loading a kenrel module - the admin may load any sort of crap. Then, sysadmin may, for instance, add a signature on a checkpointed file to verify it's integrity. (Well, one problem with this scheme in the context of self-checkpoint would be - who can be trusted to generate the signature in that case). > Some more words on specific interfaces that we have discussed: > > The single-file-descriptor approach has the big advantage of > keeping the complexity in one place (the kernel). To be consistent > with other kernel interfaces, I would make the kernel hand out a > file descriptor, not let the user open a file and pass that into > the kernel as you do now. > > A new file system is a good idea for many complex interfaces that > make their way into the kernel, but I don't think it will help > in this case. > > For checkpointing a single task, or even a task with its children, > a different interface I could imagine would be to have a new > file in procfs per pid that you can read as a pipe giving our > the same data that you currently save in the checkpoint file > descriptor. It does mean that you won't be able to pass flags > down easily (you could write to the pipe before you start reading, > but that's not too nice). Using a single handle (crid or a special file descriptor) to identify the whole checkpoint is very useful - to be able to stream it (eg. over the network, or through filters). It is also very important for future features and optimizations. For example, to reduce downtime of the application during checkpoint, one can use COW for dirty pages, and only write-back the entire data after the application resumes execution. Or imagine a use-case where one would like to keep the entire checkpoint in memory. These are pretty hard to do if you split the handling between multiple files or handles. > > On the restart side, I think the most consistent interface would > be a new binfmt_chkpt implementation that you can use to execve > a checkpoint, just like you execute an ELF file today. The binfmt > can be a module (unlike a syscall), so an administrator that is > afraid of the security implications can just disable it by not > loading the module. In an execve model, the parent process can > set up anything related to credentials as good as it's allowed > to and then let the kernel do the rest. This is an interesting idea but not without its problems. In particular, a successful execve() by one thread destroys all the others. Also, it isn't clear how this can work with pre-copying and live-migration; And finally, I'm not sure how to handle shared objects in this manner. As for kernel module - it is easy to implement most of the checkpoint restart functionality in a kernel module, leaving only the syscall stubs in the kernel. Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <48AD0379.9030705-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: checkpoint/restart ABI [not found] ` <48AD0379.9030705-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2008-08-21 8:43 ` Arnd Bergmann 0 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-21 8:43 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen On Thursday 21 August 2008, Oren Laadan wrote: > > Arnd Bergmann wrote: > Extending this view in the context of security - we can require sysadmin > privilege to restart, and then sysadmin is responsible for the contents > of the file. The kernel will ensure the the data isn't corrupted. Much > like with loading a kenrel module - the admin may load any sort of crap. > Then, sysadmin may, for instance, add a signature on a checkpointed file > to verify it's integrity. > > (Well, one problem with this scheme in the context of self-checkpoint > would be - who can be trusted to generate the signature in that case). Sorry, I don't buy that argument. I'm convinced that an implementation is possible where any user can load checkpoints of tasks that he could create by starting the processes directly. If you argue that loading a corrupted checkpoint can cause any problems, then I would assume the restart code needs better permission and sanity checks. > Using a single handle (crid or a special file descriptor) to identify > the whole checkpoint is very useful - to be able to stream it (eg. over > the network, or through filters). It is also very important for future > features and optimizations. For example, to reduce downtime of the > application during checkpoint, one can use COW for dirty pages, and > only write-back the entire data after the application resumes execution. > Or imagine a use-case where one would like to keep the entire checkpoint > in memory. These are pretty hard to do if you split the handling between > multiple files or handles. right. > > On the restart side, I think the most consistent interface would > > be a new binfmt_chkpt implementation that you can use to execve > > a checkpoint, just like you execute an ELF file today. The binfmt > > can be a module (unlike a syscall), so an administrator that is > > afraid of the security implications can just disable it by not > > loading the module. In an execve model, the parent process can > > set up anything related to credentials as good as it's allowed > > to and then let the kernel do the rest. > > This is an interesting idea but not without its problems. In particular, > a successful execve() by one thread destroys all the others. Right, execve currently assumes that the new process starts up with a single thread, but a potential binfmt_chkpt would need to potentially start multithreaded. I guess this either requires execve to reuse the existing threads (assuming they have been set up correctly in advance) or to create new ones according to the context of the checkpoint data. It may not be as easy as I thought initially, but both seem possible. Restarting a whole set of processes from a checkpoint would be a relatively simple extension of that. > Also, it isn't clear how this can work with pre-copying and live-migration; > And finally, I'm not sure how to handle shared objects in this manner. What do you mean with pre-copying? How is live-migration different from restarting a previously saved task from the same machine? > As for kernel module - it is easy to implement most of the checkpoint > restart functionality in a kernel module, leaving only the syscall stubs > in the kernel. Yeah, I've done the same in spufs, but I still think it's ugly ;-) Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <200808211043.41387.arnd@arndb.de>]
[parent not found: <200808211043.41387.arnd-r2nGTMty4D4@public.gmane.org>]
* Re: checkpoint/restart ABI [not found] ` <200808211043.41387.arnd-r2nGTMty4D4@public.gmane.org> @ 2008-08-21 15:43 ` Oren Laadan 0 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-21 15:43 UTC (permalink / raw) To: Arnd Bergmann Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen Arnd Bergmann wrote: > On Thursday 21 August 2008, Oren Laadan wrote: > >> Using a single handle (crid or a special file descriptor) to identify >> the whole checkpoint is very useful - to be able to stream it (eg. over >> the network, or through filters). It is also very important for future >> features and optimizations. For example, to reduce downtime of the >> application during checkpoint, one can use COW for dirty pages, and >> only write-back the entire data after the application resumes execution. >> Or imagine a use-case where one would like to keep the entire checkpoint >> in memory. These are pretty hard to do if you split the handling between >> multiple files or handles. > > right. > >>> On the restart side, I think the most consistent interface would >>> be a new binfmt_chkpt implementation that you can use to execve >>> a checkpoint, just like you execute an ELF file today. The binfmt >>> can be a module (unlike a syscall), so an administrator that is >>> afraid of the security implications can just disable it by not >>> loading the module. In an execve model, the parent process can >>> set up anything related to credentials as good as it's allowed >>> to and then let the kernel do the rest. >> This is an interesting idea but not without its problems. In particular, >> a successful execve() by one thread destroys all the others. > > Right, execve currently assumes that the new process starts up with > a single thread, but a potential binfmt_chkpt would need to potentially > start multithreaded. I guess this either requires execve to reuse > the existing threads (assuming they have been set up correctly in > advance) or to create new ones according to the context of the > checkpoint data. It may not be as easy as I thought initially, but > both seem possible. > Restarting a whole set of processes from a checkpoint would be > a relatively simple extension of that. > >> Also, it isn't clear how this can work with pre-copying and live-migration; >> And finally, I'm not sure how to handle shared objects in this manner. > > What do you mean with pre-copying? > How is live-migration different from restarting a previously saved > task from the same machine? By pre-copying I refer to the first stage of live-migration: to reduce down time, much of the state of a container can be saved while tasks are still running (most notably memory, but also file system snapshot, if need be). Since the state may change, this is repeated - to save the what changed in the meanwhile - until the delta is small enough. During all this time the tasks continue to execute. At this point, we freeze the container, save the last delta, and resume (in case of snapshot) or or kill (in case of live-migration) the container. I'm not convinced that execve() is the best way to handle this iterative process. Also, with multiple tasks in a container, data for consecutive tasks will appear in order in the checkpoint image. Moreover, a future optimization would be the have multiple threads checkpoint the container, with data interleaved in the checkpoint image stream. Here, too, I'm not sure how execve()-like approach plays. Finally there is the case of shared objects: v2 demonstrates this in checkpoint/objhash.c (see also Documentation/checkpoint.txt). Again, I'm not sure how execve() can adapt to this need. I definitely agree that using something like execve() is elegant and has its advantages. It just isn't clear to me that it is truly suitable for the needs. Suggestions are welcome. Oren. > >> As for kernel module - it is easy to implement most of the checkpoint >> restart functionality in a kernel module, leaving only the syscall stubs >> in the kernel. > > Yeah, I've done the same in spufs, but I still think it's ugly ;-) > > Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <489CB3CA.6050304@cs.columbia.edu>]
[parent not found: <1218233855.19082.52.camel@nimitz>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <1218233855.19082.52.camel@nimitz> @ 2008-08-08 23:27 ` Oren Laadan 0 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-08 23:27 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann Dave Hansen wrote: > On Fri, 2008-08-08 at 16:59 -0400, Oren Laadan wrote: >> To avoid repetitive malloc/free, ctx->hbuf is a buffer to host headers >> as they are read; > > kmalloc/kfree() area really, really fast. I wonder if the code gets > easier or harder to read if we just alloc/free as we need to. The ctx->hbuf interface is a pair of cr_hbuf_get(ctx, length) and a matching cr_hbuf_put(ctx, length), almost like using kmalloc/kfree(). The main difference is that cleanup in error paths is implicit (the whole buffer is freed when the ctx is deallocated). > How large are these allocations, usually? Will stack allocation work in > most cases? That depends on how we construct the headers. In Zap there are some headers that use relatively long structures to be put on the stack, and it wouldn't make much sense to divide them into smaller headers artificially. However, I forgot to mention earlier that an important reason to use this construct is actually in anticipation for a future optimization: during application downtime the checkpoint state will be aggregated into an in-memory buffer, and only after the application is allowed to continue execution (unfrozen) the buffer will be written-back to the FD. In that scenario, we will allocate a larger buffer in the ctx (eg based on some heuristics) and cr_hbuf_get() will return the next location in that buffer, while cr_hbuf_put() will do nothing. Oren. ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <489CB3CA.6050304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <489CB3CA.6050304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2008-08-08 22:17 ` Dave Hansen 2008-08-08 22:23 ` Arnd Bergmann 2008-08-14 8:09 ` [Devel] " Pavel Emelyanov 2 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-08 22:17 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann On Fri, 2008-08-08 at 16:59 -0400, Oren Laadan wrote: > To avoid repetitive malloc/free, ctx->hbuf is a buffer to host headers > as they are read; kmalloc/kfree() area really, really fast. I wonder if the code gets easier or harder to read if we just alloc/free as we need to. How large are these allocations, usually? Will stack allocation work in most cases? -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <489CB3CA.6050304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2008-08-08 22:17 ` Dave Hansen @ 2008-08-08 22:23 ` Arnd Bergmann 2008-08-14 8:09 ` [Devel] " Pavel Emelyanov 2 siblings, 0 replies; 71+ messages in thread From: Arnd Bergmann @ 2008-08-08 22:23 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen On Friday 08 August 2008, Oren Laadan wrote: > > Yeah, this is very true. My guess is that we'll need something like > > what we do with modversions. > > Exactly. The header should eventually contain sufficient information > to describe the kernel version, configuration, compiler, cpu (arch and > capabilities), and checkpoint code version. > > How would you suggest to identify the origin tree with an identifier > (or a text field) in the header ? Including struct utsname in the header covers most of this. I supposed you can't do it entirely safe, and you always need to be prepared for malicious input data, so there probably isn't much point in getting too fancy beyond what you need to do to prevent user errors. > If it is preferred, we can change this to write a kernel message and > return a special error telling that a logical error has occurred. My recommendation in general is to make kernel code crash loudly if there is a bug in the kernel itself. Returning error codes makes most sense if they get sent back to the user, which then can make sense of it, as documented in the man page of the syscall. > > Yes, eventually. I think one good point is that we should probably > > remove this now so that we *have* to think about security implications > > as we add each individual patch. For instance, what kind of checking do > > we do when we restore an mlock()'d VMA? > > > > I'll pull this check out so it causes pain. (the good kind) > > Hmmm... even if not strictly now, we *will* need admin privileges for > the CR operations, for the following reasons: > > checkpoint: we save the entire state of a set of processes to a file - so > we must have privileges to do so, at least within (or with respect to) the > said container. Even if we are the user who owns the container, we'll need > root access within that container. > > restart: we restore the entire set of a set of processes, which may require > some privileged operations (again, at least within or with respect to the > said container). Otherwise any user could inject any restart data into the > kernel and create any set of processes with arbitrary permissions. > > In a sense, we need something similar to granting ptrace access. Exactly. There was a project that implemented checkpoint/restart through ptrace (don't remember what it was called), so with certain limitations it should also be possible to implement the syscalls so that any user that can ptrace the tasks can also checkpoint them. Arnd <>< ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <489CB3CA.6050304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2008-08-08 22:17 ` Dave Hansen 2008-08-08 22:23 ` Arnd Bergmann @ 2008-08-14 8:09 ` Pavel Emelyanov 2 siblings, 0 replies; 71+ messages in thread From: Pavel Emelyanov @ 2008-08-14 8:09 UTC (permalink / raw) To: Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann, Dave Hansen Oren Laadan wrote: > > Dave Hansen wrote: >> On Fri, 2008-08-08 at 11:46 +0200, Arnd Bergmann wrote: >>> On Friday 08 August 2008, Dave Hansen wrote: >>>> + hh->magic = 0x00a2d200; >>>> + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff; >>>> + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff; >>>> + hh->patch = (LINUX_VERSION_CODE) & 0xff; >> ... >>>> +} >>> Do you rely on the kernel version in order to determine the format >>> of the binary data, or is it just informational? >>> >>> If you think the format can change in incompatible ways, you >>> probably need something more specific than the version number >>> these days, because there are just so many different trees with >>> the same numbers. >> Yeah, this is very true. My guess is that we'll need something like >> what we do with modversions. > > Exactly. The header should eventually contain sufficient information > to describe the kernel version, configuration, compiler, cpu (arch and > capabilities), and checkpoint code version. Sorry for late response - I was on vacation till Wednesday. I am opposed against having *explicit* information about the kernel configuration inside the image. I see this like we store object images in a file, and these images do *not* change depending on the .config. But instead of this, at the time of restore we may *only* detect whether we can restore this type of object or not. E.g. consider we are saving a container image on ipv6 node and trying to restore from it on the one without the ipv6. In that case we *may* have some object of for example CKPT_IPV6_IFA type of CLPT_IPV6_SOCK_INFO and fail the restoration process when finding such in an input file. But what we should *not* do is to write any information about whether we had the CONFIG_IPV6 turned on on the dumping side and check for this on the restoring side. (Sorry, if this question is already settled, but the discussion thread built by my mailer is a bit messy, so I suspect I could miss some part of the threads) > How would you suggest to identify the origin tree with an identifier > (or a text field) in the header ? ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <48A3E81C.6010008@openvz.org>]
[parent not found: <48A3E81C.6010008-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [Devel] Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <48A3E81C.6010008-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-08-14 15:16 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-14 15:16 UTC (permalink / raw) To: Pavel Emelyanov Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann On Thu, 2008-08-14 at 12:09 +0400, Pavel Emelyanov wrote: > E.g. consider we are saving a container image on ipv6 node and trying > to restore from it on the one without the ipv6. In that case we *may* > have some object of for example CKPT_IPV6_IFA type of CLPT_IPV6_SOCK_INFO > and fail the restoration process when finding such in an input file. But > what we should *not* do is to write any information about whether we had > the CONFIG_IPV6 turned on on the dumping side and check for this on the > restoring side. The only problem I can see with this is that you lose efficiency, especially when you have to build your checkpoint image with lots of things that are config-specific. The approach sounds like a good one in theory, but I'm a bit skeptical that we could stick to it in practice, in a mainline kernel where there are billions of config options. It is definitely something to strive for, though. Good point! -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <20080807224034.735B1F84@kernel> 2008-08-08 9:46 ` [RFC][PATCH 1/4] checkpoint-restart: general infrastructure Arnd Bergmann [not found] ` <200808081146.54834.arnd@arndb.de> @ 2008-08-11 18:03 ` Jonathan Corbet [not found] ` <20080811120315.4b3ba2c8@bike.lwn.net> 2008-08-18 9:26 ` [Devel] " Pavel Emelyanov 4 siblings, 0 replies; 71+ messages in thread From: Jonathan Corbet @ 2008-08-11 18:03 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA I'm trying to figure out this patch set...here's a few things which have caught my eye in passing. > +/** > + * cr_get_fname - return pathname of a given file > + * @file: file pointer > + * @buf: buffer for pathname > + * @n: buffer length (in) and pathname length (out) > + * > + * if the buffer provivded by the caller is too small, allocate a new > + * buffer; caller should call cr_put_pathname() for cleanup > + */ > +char *cr_get_fname(struct path *path, struct path *root, char *buf, > int *n) +{ > + char *fname; > + > + fname = __d_path(path, root, buf, *n); > + > + if (IS_ERR(fname) && PTR_ERR(fname) == -ENAMETOOLONG) { > + if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0))) This seems like a clunky and error-prone interface - why not just have it allocate the memory always? But, in this case, cr_get_fname() always seems to be called with ctx->tbuf, which, in turn, is an order-1 allocation. Here you're saying that if it's too small, you'll try replacing it with an order-0 allocation instead. I rather suspect that's not going to help. > +/* write the checkpoint header */ > +static int cr_write_hdr(struct cr_ctx *ctx) > +{ > + struct cr_hdr h; > + struct cr_hdr_head *hh = ctx->tbuf; > + struct timeval ktv; > + > + h.type = CR_HDR_HEAD; > + h.len = sizeof(*hh); > + h.id = 0; > + > + do_gettimeofday(&ktv); > + > + hh->magic = 0x00a2d200; This magic number is hard-coded in a number of places. Could it maybe benefit from a macro, which, in turn, could maybe end up in linux/magic.h? > +/* dump the task_struct of a given task */ > +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t) This function is going to break every time somebody changes struct task_struct. I'm not quite sure how to prevent that. I wonder if the modversions stuff could somehow be employed to detect changes and make the build fail? > +/** > + * sys_checkpoint - checkpoint a container > + * @pid: pid of the container init(1) process > + * @fd: file to which dump the checkpoint image > + * @flags: checkpoint operation flags > + */ > +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long > flags) +{ > + struct cr_ctx *ctx; > + struct file *file; > + int fput_needed; > + int ret; > + > + if (!capable(CAP_SYS_ADMIN)) > + return -EPERM; Like others, I wondered why CAP_SYS_ADMIN was required here. I *still* wonder, though, how you'll ever be able to do restart without a privilege check. There must be a thousand ways to compromise a system by messing with the checkpoint file. > + file = fget_light(fd, &fput_needed); > + if (!file) > + return -EBADF; Should you maybe check for write access? An attempt to overwrite a read-only file won't succeed, but you could save a lot of work by just failing it with a clear code here. What about the file position? Perhaps there could be a good reason to checkpoint a process into the middle of a file, don't know. In general, I don't see a whole lot of locking going on. Is it really possible to save and restore memory without ever holding mmap_sem? jon ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <20080811120315.4b3ba2c8@bike.lwn.net>]
[parent not found: <20080811120315.4b3ba2c8-vw3g6Xz/EtPk1uMJSBkQmQ@public.gmane.org>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <20080811120315.4b3ba2c8-vw3g6Xz/EtPk1uMJSBkQmQ@public.gmane.org> @ 2008-08-11 18:38 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-11 18:38 UTC (permalink / raw) To: Jonathan Corbet Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA On Mon, 2008-08-11 at 12:03 -0600, Jonathan Corbet wrote: > I'm trying to figure out this patch set...here's a few things which > have caught my eye in passing. > > > +/** > > + * cr_get_fname - return pathname of a given file > > + * @file: file pointer > > + * @buf: buffer for pathname > > + * @n: buffer length (in) and pathname length (out) > > + * > > + * if the buffer provivded by the caller is too small, allocate a new > > + * buffer; caller should call cr_put_pathname() for cleanup > > + */ > > +char *cr_get_fname(struct path *path, struct path *root, char *buf, > > int *n) +{ > > + char *fname; > > + > > + fname = __d_path(path, root, buf, *n); > > + > > + if (IS_ERR(fname) && PTR_ERR(fname) == -ENAMETOOLONG) { > > + if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0))) > > This seems like a clunky and error-prone interface - why not just have it > allocate the memory always? But, in this case, cr_get_fname() always seems > to be called with ctx->tbuf, which, in turn, is an order-1 allocation. > Here you're saying that if it's too small, you'll try replacing it with an > order-0 allocation instead. I rather suspect that's not going to help. Yeah, it doesn't make much sense on the surface. I would imagine that this has some use for when we're stacking things up in the ctx->hbuf rather than just using it as a completely temporary buffer. But, in any case, it doesn't make sense as it stands now, so I think it needs to be reworked. > > +/* write the checkpoint header */ > > +static int cr_write_hdr(struct cr_ctx *ctx) > > +{ > > + struct cr_hdr h; > > + struct cr_hdr_head *hh = ctx->tbuf; > > + struct timeval ktv; > > + > > + h.type = CR_HDR_HEAD; > > + h.len = sizeof(*hh); > > + h.id = 0; > > + > > + do_gettimeofday(&ktv); > > + > > + hh->magic = 0x00a2d200; > > This magic number is hard-coded in a number of places. Could it maybe > benefit from a macro, which, in turn, could maybe end up in linux/magic.h? > > > +/* dump the task_struct of a given task */ > > +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t) > > This function is going to break every time somebody changes struct > task_struct. I'm not quite sure how to prevent that. I wonder if the > modversions stuff could somehow be employed to detect changes and make the > build fail? In general, I think any time that we are checkpointing $THING and $THING changes, the checkpoint will break. It just so happens that all we're checkpointing here is the task_struct, so $THING == task_struct for now. :) The things that *really* worry me are things like when flags change semantics subtly. Or, let's say a flag is used for two different things in 2.6.26.4 vs 2.6.27. I'm not sure we're ever going to be in a position to find and fix up stuff like that. That's one reason I have been advocating doing checkpoint/restart in much tinier bits so that we can understand each of them as we go along. I also think that the *less* we expose to userspace, the better. > > +/** > > + * sys_checkpoint - checkpoint a container > > + * @pid: pid of the container init(1) process > > + * @fd: file to which dump the checkpoint image > > + * @flags: checkpoint operation flags > > + */ > > +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long > > flags) +{ > > + struct cr_ctx *ctx; > > + struct file *file; > > + int fput_needed; > > + int ret; > > + > > + if (!capable(CAP_SYS_ADMIN)) > > + return -EPERM; > > Like others, I wondered why CAP_SYS_ADMIN was required here. I *still* > wonder, though, how you'll ever be able to do restart without a privilege > check. There must be a thousand ways to compromise a system by messing > with the checkpoint file. As with everything else coming from userspace, the checkpoint file should be completely untrusted. I do think, though, that the ptrace analogy that Serge?? made is a good one. > > + file = fget_light(fd, &fput_needed); > > + if (!file) > > + return -EBADF; > > Should you maybe check for write access? An attempt to overwrite a > read-only file won't succeed, but you could save a lot of work by just > failing it with a clear code here. That's true. I'll take a look and see. This patch does reach down and use vfs_write() at some point. There really aren't any other in-kernel users that do this (short of ecryptfs and plan9fs). That makes me doubt that we're even using a good approach here. > What about the file position? Perhaps there could be a good reason to > checkpoint a process into the middle of a file, don't know. I think this is a good example of a place where the kernel can let userspace shoot itself in its foot if it wants. We might also want to allow things to be sent over fds that don't necessarily have positions, like sockets or pipes. > In general, I don't see a whole lot of locking going on. Is it really > possible to save and restore memory without ever holding mmap_sem? I personally haven't audited the locking, yet. It is going to be fun! But, take a look in patch 3/4: + /* write the vma's */ + down_read(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + if ((ret = cr_write_vma(ctx, vma)) < 0) + break; + } + up_read(&mm->mmap_sem); Thanks for the review, Jonathan! -- Dave _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <1218479921.5598.35.camel@nimitz>]
* Re: [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <1218479921.5598.35.camel@nimitz> @ 2008-08-12 3:44 ` Oren Laadan 0 siblings, 0 replies; 71+ messages in thread From: Oren Laadan @ 2008-08-12 3:44 UTC (permalink / raw) To: Dave Hansen Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Jonathan Corbet Dave Hansen wrote: > On Mon, 2008-08-11 at 12:03 -0600, Jonathan Corbet wrote: >> I'm trying to figure out this patch set...here's a few things which >> have caught my eye in passing. >> >>> +/** >>> + * cr_get_fname - return pathname of a given file >>> + * @file: file pointer >>> + * @buf: buffer for pathname >>> + * @n: buffer length (in) and pathname length (out) >>> + * >>> + * if the buffer provivded by the caller is too small, allocate a new >>> + * buffer; caller should call cr_put_pathname() for cleanup >>> + */ >>> +char *cr_get_fname(struct path *path, struct path *root, char *buf, >>> int *n) +{ >>> + char *fname; >>> + >>> + fname = __d_path(path, root, buf, *n); >>> + >>> + if (IS_ERR(fname) && PTR_ERR(fname) == -ENAMETOOLONG) { >>> + if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0))) >> This seems like a clunky and error-prone interface - why not just have it >> allocate the memory always? But, in this case, cr_get_fname() always seems >> to be called with ctx->tbuf, which, in turn, is an order-1 allocation. >> Here you're saying that if it's too small, you'll try replacing it with an >> order-0 allocation instead. I rather suspect that's not going to help. > > Yeah, it doesn't make much sense on the surface. I would imagine that > this has some use for when we're stacking things up in the ctx->hbuf > rather than just using it as a completely temporary buffer. But, in any > case, it doesn't make sense as it stands now, so I think it needs to be > reworked. Dave is right on the money: in Zap (the equivalent of) cr_get_fname() may be called with a buffer smaller than PATH_MAX (one page) and hence the need to allocate ad-hoc. Indeed in the current code this is not the case (yet?) so I'll go ahead and simplify it. > >>> +/* write the checkpoint header */ >>> +static int cr_write_hdr(struct cr_ctx *ctx) >>> +{ >>> + struct cr_hdr h; >>> + struct cr_hdr_head *hh = ctx->tbuf; >>> + struct timeval ktv; >>> + >>> + h.type = CR_HDR_HEAD; >>> + h.len = sizeof(*hh); >>> + h.id = 0; >>> + >>> + do_gettimeofday(&ktv); >>> + >>> + hh->magic = 0x00a2d200; >> This magic number is hard-coded in a number of places. Could it maybe >> benefit from a macro, which, in turn, could maybe end up in linux/magic.h? >> >>> +/* dump the task_struct of a given task */ >>> +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t) >> This function is going to break every time somebody changes struct >> task_struct. I'm not quite sure how to prevent that. I wonder if the >> modversions stuff could somehow be employed to detect changes and make the >> build fail? > > In general, I think any time that we are checkpointing $THING and $THING > changes, the checkpoint will break. It just so happens that all we're > checkpointing here is the task_struct, so $THING == task_struct for > now. :) > > The things that *really* worry me are things like when flags change > semantics subtly. Or, let's say a flag is used for two different things > in 2.6.26.4 vs 2.6.27. I'm not sure we're ever going to be in a > position to find and fix up stuff like that. One way to reduce the risk is to use an intermediate representation to kernel native data and properties (e.g. classify VMAs during checkpoint instead of relying blindly on the flags). The problem is not so much in restarting a checkpoint image from old kernel on a new kernel - that can be handled by conversion in user space. Tracking changes affecting the checkpoint/restart logic - well, if eventually checkpoint/restart gets to becomes main-stream enough that developers will be aware of it. > > That's one reason I have been advocating doing checkpoint/restart in > much tinier bits so that we can understand each of them as we go along. > I also think that the *less* we expose to userspace, the better. > >>> +/** >>> + * sys_checkpoint - checkpoint a container >>> + * @pid: pid of the container init(1) process >>> + * @fd: file to which dump the checkpoint image >>> + * @flags: checkpoint operation flags >>> + */ >>> +asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long >>> flags) +{ >>> + struct cr_ctx *ctx; >>> + struct file *file; >>> + int fput_needed; >>> + int ret; >>> + >>> + if (!capable(CAP_SYS_ADMIN)) >>> + return -EPERM; >> Like others, I wondered why CAP_SYS_ADMIN was required here. I *still* >> wonder, though, how you'll ever be able to do restart without a privilege >> check. There must be a thousand ways to compromise a system by messing >> with the checkpoint file. > > As with everything else coming from userspace, the checkpoint file > should be completely untrusted. I do think, though, that the ptrace > analogy that Serge?? made is a good one. The only reason I made the analogy without actually implementing it is lack of time and familiarity with the ptrace permission world. > >>> + file = fget_light(fd, &fput_needed); >>> + if (!file) >>> + return -EBADF; >> Should you maybe check for write access? An attempt to overwrite a >> read-only file won't succeed, but you could save a lot of work by just >> failing it with a clear code here. > > That's true. I'll take a look and see. > > This patch does reach down and use vfs_write() at some point. There > really aren't any other in-kernel users that do this (short of ecryptfs > and plan9fs). That makes me doubt that we're even using a good approach > here. > >> What about the file position? Perhaps there could be a good reason to >> checkpoint a process into the middle of a file, don't know. > > I think this is a good example of a place where the kernel can let > userspace shoot itself in its foot if it wants. We might also want to > allow things to be sent over fds that don't necessarily have positions, > like sockets or pipes. > >> In general, I don't see a whole lot of locking going on. Is it really >> possible to save and restore memory without ever holding mmap_sem? > > I personally haven't audited the locking, yet. It is going to be fun! There is some optimistic locking (mmap_sem), improved in the next version. Thanks, Oren. _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 71+ messages in thread
* Re: [Devel] [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <20080807224034.735B1F84@kernel> ` (3 preceding siblings ...) [not found] ` <20080811120315.4b3ba2c8@bike.lwn.net> @ 2008-08-18 9:26 ` Pavel Emelyanov [not found] ` <48A94061.8040206-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> 4 siblings, 1 reply; 71+ messages in thread From: Pavel Emelyanov @ 2008-08-18 9:26 UTC (permalink / raw) To: Dave Hansen, Oren Laadan Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann > diff -puN /dev/null ckpt/ckpt_hdr.h > --- /dev/null 2007-04-11 11:48:27.000000000 -0700 > +++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-07 15:37:22.000000000 -0700 > @@ -0,0 +1,69 @@ > +/* > + * Generic container checkpoint-restart > + * > + * Copyright (C) 2008 Oren Laadan > + * > + * This file is subject to the terms and conditions of the GNU General Public > + * License. See the file COPYING in the main directory of the Linux > + * distribution for more details. > + */ > + > +#include <linux/types.h> > + > +struct cr_hdr { > + __s16 type; > + __s16 len; > + __u32 id; > +}; Sorry for probably being out-of-date again, but isn't it better to put these headers in the include/linux and export them to the user space? Why? Because we'll need some image-dumping tool (let alone the image converting one for compatibility purposes) and these tools would require to know how the image looks like. Thanks, Pavel ^ permalink raw reply [flat|nested] 71+ messages in thread
[parent not found: <48A94061.8040206-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>]
* Re: [Devel] [RFC][PATCH 1/4] checkpoint-restart: general infrastructure [not found] ` <48A94061.8040206-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org> @ 2008-08-20 19:10 ` Dave Hansen 0 siblings, 0 replies; 71+ messages in thread From: Dave Hansen @ 2008-08-20 19:10 UTC (permalink / raw) To: Pavel Emelyanov Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Arnd Bergmann On Mon, 2008-08-18 at 13:26 +0400, Pavel Emelyanov wrote: > > diff -puN /dev/null ckpt/ckpt_hdr.h > > --- /dev/null 2007-04-11 11:48:27.000000000 -0700 > > +++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-07 15:37:22.000000000 -0700 > > @@ -0,0 +1,69 @@ > > +/* > > + * Generic container checkpoint-restart > > + * > > + * Copyright (C) 2008 Oren Laadan > > + * > > + * This file is subject to the terms and conditions of the GNU General Public > > + * License. See the file COPYING in the main directory of the Linux > > + * distribution for more details. > > + */ > > + > > +#include <linux/types.h> > > + > > +struct cr_hdr { > > + __s16 type; > > + __s16 len; > > + __u32 id; > > +}; > > Sorry for probably being out-of-date again, but isn't it better to > put these headers in the include/linux and export them to the user > space? > > Why? Because we'll need some image-dumping tool (let alone the > image converting one for compatibility purposes) and these tools > would require to know how the image looks like. What's the deal with headers being exported these days? Don't we always have to sanitize them before we ship them over to userspace anyway? -- Dave ^ permalink raw reply [flat|nested] 71+ messages in thread
* [RFC][PATCH 0/4] kernel-based checkpoint restart
@ 2008-08-07 22:40 Dave Hansen
0 siblings, 0 replies; 71+ messages in thread
From: Dave Hansen @ 2008-08-07 22:40 UTC (permalink / raw)
To: Oren Laadan
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
Theodore Tso, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Dave Hansen
These patches are from Oren Laaden. I've refactored them
a bit to make them a wee bit more reviewable. I think this
separates out the per-arch bits pretty well. It should also
be at least build-bisetable.
If there are no objections to this general approach, then we
plan to start submitting these bits to -mm.
--
At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach. With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.
This is different that what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data. We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing. This unites those into a single, grand interface.
The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs. It will also contain
copies of the actual memory that the process uses. Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.
This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.
These patches basically serialize internel kernel state and write
it out to a file descriptor. The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.
In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped.
--
Oren's original announcement
In the recent mini-summit at OLS 2008 and the following days it was
agreed to tackle the checkpoint/restart (CR) by beginning with a very
simple case: save and restore a single task, with simple memory
layout, disregarding other task state such as files, signals etc.
Following these discussions I coded a prototype that can do exactly
that, as a starter. This code adds two system calls - sys_checkpoint
and sys_restart - that a task can call to save and restore its state
respectively. It also demonstrates how the checkpoint image file can
be formatted, as well as show its nested nature (e.g. cr_write_mm()
-> cr_write_vma() nesting).
The state that is saved/restored is the following:
* some of the task_struct
* some of the thread_struct and thread_info
* the cpu state (including FPU)
* the memory address space
[The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43
of Linus's tree (uhhh.. don't ask why), but against tonight's head too].
In the current code, sys_checkpoint will checkpoint the current task,
although the logic exists to checkpoint other tasks (not in the
checkpointee's execution context). A simple loop will extend this to
handle multiple processes. sys_restart restarts the current tasks, and
with multiple tasks each task will call the syscall independently.
(Actually, to checkpoint outside the context of a task, it is also
necessary to also handle restart-block logic when saving/restoring the
thread data).
It takes longer to describe what isn't implemented or supported by
this prototype ... basically everything that isn't as simple as the
above.
As for containers - since we still don't have a representation for a
container, this patch has no notion of a container. The tests for
consistent namespaces (and isolation) are also omitted.
Below are two example programs: one uses checkpoint (called ckpt) and
one uses restart (called rstr). Execute like this (as a superuser):
orenl:~/test$ ./ckpt > out.1
hello, world! (ret=1) <-- sys_checkpoint returns positive id
<-- ctrl-c
orenl:~/test$ ./ckpt > out.2
hello, world! (ret=2)
<-- ctrl-c
orenl:~/test$ ./rstr < out.1
hello, world! (ret=0) <-- sys_restart return 0
(if you check the output of ps, you'll see that "rstr" changed its
name to "ckpt", as expected).
Hoping this will accelerate the discussion. Comments are welcome.
Let the fun begin :)
Oren.
============================== ckpt.c ================================
#define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <asm/unistd_32.h>
#include <sys/syscall.h>
int main(int argc, char *argv[])
{
pid_t pid = getpid();
int ret;
ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
if (ret < 0)
perror("checkpoint");
fprintf(stderr, "hello, world! (ret=%d)\n", ret);
while (1)
;
return 0;
}
============================== rstr.c ================================
#define _GNU_SOURCE /* or _BSD_SOURCE or _SVID_SOURCE */
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <asm/unistd_32.h>
#include <sys/syscall.h>
int main(int argc, char *argv[])
{
pid_t pid = getpid();
int ret;
ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
if (ret < 0)
perror("restart");
printf("should not reach here !\n");
return 0;
}
^ permalink raw reply [flat|nested] 71+ messages in threadend of thread, other threads:[~2008-08-28 23:40 UTC | newest]
Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20080807224033.FFB3A2C1@kernel>
2008-08-07 22:40 ` [RFC][PATCH 1/4] checkpoint-restart: general infrastructure Dave Hansen
2008-08-07 22:40 ` [RFC][PATCH 2/4] checkpoint/restart: x86 support Dave Hansen
2008-08-07 22:40 ` [RFC][PATCH 3/4] checkpoint/restart: memory management Dave Hansen
2008-08-07 22:40 ` [RFC][PATCH 4/4] introduce sys_checkpoint and sys_restore Dave Hansen
2008-08-08 9:25 ` [RFC][PATCH 0/4] kernel-based checkpoint restart Arnd Bergmann
[not found] ` <1218218784.19082.10.camel@nimitz>
2008-08-08 18:18 ` Arnd Bergmann
[not found] ` <200808081125.12706.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-08 18:06 ` Dave Hansen
2008-08-08 19:44 ` Oren Laadan
[not found] ` <20080807224035.E56663BF@kernel>
2008-08-08 12:09 ` [RFC][PATCH 2/4] checkpoint/restart: x86 support Arnd Bergmann
[not found] ` <200808081409.30591.arnd@arndb.de>
[not found] ` <200808081409.30591.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-08 20:28 ` Oren Laadan
[not found] ` <489CAC70.7090809-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-08 22:29 ` Arnd Bergmann
[not found] ` <200808090029.28286.arnd@arndb.de>
[not found] ` <200808090029.28286.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-08 23:04 ` Oren Laadan
[not found] ` <489CD0F9.9060603@cs.columbia.edu>
[not found] ` <489CD0F9.9060603-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-09 0:38 ` Dave Hansen
2008-08-09 6:43 ` Arnd Bergmann
[not found] ` <1218242286.19082.62.camel@nimitz>
2008-08-09 1:20 ` Oren Laadan
[not found] ` <489CF0CE.1000603@cs.columbia.edu>
[not found] ` <489CF0CE.1000603-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-09 2:20 ` Dave Hansen
[not found] ` <1218248458.19082.68.camel@nimitz>
2008-08-09 2:35 ` Oren Laadan
2008-08-10 14:55 ` Jeremy Fitzhardinge
[not found] ` <489F015E.9080704@goop.org>
[not found] ` <489F015E.9080704-TSDbQ3PG+2Y@public.gmane.org>
2008-08-11 15:36 ` Dave Hansen
[not found] ` <1218468964.5598.3.camel@nimitz>
2008-08-11 16:07 ` Jeremy Fitzhardinge
[not found] ` <20080807224037.44DA0DB8@kernel>
2008-08-08 12:12 ` [RFC][PATCH 3/4] checkpoint/restart: memory management Arnd Bergmann
[not found] ` <20080807224038.0B03CEEF@kernel>
2008-08-08 12:15 ` [RFC][PATCH 4/4] introduce sys_checkpoint and sys_restore Arnd Bergmann
[not found] ` <200808081415.19179.arnd@arndb.de>
[not found] ` <200808081415.19179.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-08 20:33 ` Oren Laadan
[not found] ` <20080807224034.735B1F84@kernel>
2008-08-08 9:46 ` [RFC][PATCH 1/4] checkpoint-restart: general infrastructure Arnd Bergmann
[not found] ` <200808081146.54834.arnd@arndb.de>
[not found] ` <200808081146.54834.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-08 18:50 ` Dave Hansen
[not found] ` <1218221451.19082.36.camel@nimitz>
2008-08-08 20:59 ` Oren Laadan
2008-08-08 22:13 ` Arnd Bergmann
[not found] ` <200808090013.41999.arnd@arndb.de>
[not found] ` <200808090013.41999.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-08 22:26 ` Dave Hansen
2008-08-11 15:22 ` Serge E. Hallyn
[not found] ` <1218234411.19082.58.camel@nimitz>
2008-08-08 22:39 ` Arnd Bergmann
[not found] ` <200808090039.20289.arnd@arndb.de>
[not found] ` <200808090039.20289.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-09 0:43 ` Dave Hansen
[not found] ` <1218242614.19082.65.camel@nimitz>
2008-08-09 6:37 ` Arnd Bergmann
[not found] ` <200808090837.07417.arnd@arndb.de>
[not found] ` <200808090837.07417.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-09 13:39 ` Dave Hansen
2008-08-11 15:07 ` Serge E. Hallyn
[not found] ` <20080811150703.GA25930@us.ibm.com>
[not found] ` <20080811150703.GA25930-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-11 15:25 ` Arnd Bergmann
2008-08-14 5:53 ` Pavel Machek
[not found] ` <20080814055301.GH6995@ucw.cz>
[not found] ` <20080814055301.GH6995-+ZI9xUNit7I@public.gmane.org>
2008-08-14 15:12 ` Dave Hansen
2008-08-20 21:40 ` Oren Laadan
[not found] ` <20080811152201.GB25930@us.ibm.com>
[not found] ` <20080811152201.GB25930-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-11 16:53 ` Arnd Bergmann
[not found] ` <200808111853.13854.arnd@arndb.de>
[not found] ` <200808111853.13854.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-11 17:11 ` Dave Hansen
2008-08-11 19:48 ` checkpoint/restart ABI Dave Hansen
[not found] ` <1218484114.5598.43.camel@nimitz>
2008-08-11 21:47 ` Arnd Bergmann
2008-08-11 21:54 ` Oren Laadan
2008-08-11 23:38 ` Jeremy Fitzhardinge
[not found] ` <48A0CD86.6030704@goop.org>
[not found] ` <48A0CD86.6030704-TSDbQ3PG+2Y@public.gmane.org>
2008-08-11 23:54 ` Peter Chubb
2008-08-12 14:58 ` Dave Hansen
[not found] ` <87d4kfds5i.wl%peterc@chubb.wattle.id.au>
[not found] ` <87d4kfds5i.wl%peterc-LkDQP0DxSMGxwJ88Py/mJxCuuivNXqWP@public.gmane.org>
2008-08-12 14:49 ` Serge E. Hallyn
2008-08-12 15:11 ` Dave Hansen
[not found] ` <20080812144905.GA16016@us.ibm.com>
[not found] ` <20080812144905.GA16016-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-08-28 23:40 ` Eric W. Biederman
[not found] ` <1218553091.5598.76.camel@nimitz>
2008-08-12 16:32 ` Jeremy Fitzhardinge
[not found] ` <48A1BB39.3090108@goop.org>
[not found] ` <48A1BB39.3090108-TSDbQ3PG+2Y@public.gmane.org>
2008-08-12 16:46 ` Dave Hansen
[not found] ` <1218559619.5598.97.camel@nimitz>
2008-08-12 17:04 ` Jeremy Fitzhardinge
[not found] ` <48A1C2B9.9070107@goop.org>
[not found] ` <48A1C2B9.9070107-TSDbQ3PG+2Y@public.gmane.org>
2008-08-20 21:52 ` Oren Laadan
2008-08-20 21:54 ` Oren Laadan
[not found] ` <48AC929C.9030901@cs.columbia.edu>
[not found] ` <48AC929C.9030901-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-20 22:11 ` Dave Hansen
[not found] ` <200808112347.50245.arnd@arndb.de>
[not found] ` <20080811171433.2ce81f28@bike.lwn.net>
[not found] ` <20080811171433.2ce81f28-vw3g6Xz/EtPk1uMJSBkQmQ@public.gmane.org>
2008-08-11 23:23 ` Dave Hansen
[not found] ` <200808112347.50245.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-11 23:14 ` Jonathan Corbet
2008-08-21 5:56 ` Oren Laadan
[not found] ` <48AD0379.9030705-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-21 8:43 ` Arnd Bergmann
[not found] ` <200808211043.41387.arnd@arndb.de>
[not found] ` <200808211043.41387.arnd-r2nGTMty4D4@public.gmane.org>
2008-08-21 15:43 ` Oren Laadan
[not found] ` <489CB3CA.6050304@cs.columbia.edu>
[not found] ` <1218233855.19082.52.camel@nimitz>
2008-08-08 23:27 ` [RFC][PATCH 1/4] checkpoint-restart: general infrastructure Oren Laadan
[not found] ` <489CB3CA.6050304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-08 22:17 ` Dave Hansen
2008-08-08 22:23 ` Arnd Bergmann
2008-08-14 8:09 ` [Devel] " Pavel Emelyanov
[not found] ` <48A3E81C.6010008@openvz.org>
[not found] ` <48A3E81C.6010008-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-08-14 15:16 ` Dave Hansen
2008-08-11 18:03 ` Jonathan Corbet
[not found] ` <20080811120315.4b3ba2c8@bike.lwn.net>
[not found] ` <20080811120315.4b3ba2c8-vw3g6Xz/EtPk1uMJSBkQmQ@public.gmane.org>
2008-08-11 18:38 ` Dave Hansen
[not found] ` <1218479921.5598.35.camel@nimitz>
2008-08-12 3:44 ` Oren Laadan
2008-08-18 9:26 ` [Devel] " Pavel Emelyanov
[not found] ` <48A94061.8040206-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
2008-08-20 19:10 ` Dave Hansen
2008-08-07 22:40 [RFC][PATCH 0/4] kernel-based checkpoint restart Dave Hansen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox