All of lore.kernel.org
 help / color / mirror / Atom feed
* [BIG RFC] Filesystem-based checkpoint
@ 2008-10-28 18:37 Dave Hansen
  2008-10-28 20:56 ` Serge E. Hallyn
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Dave Hansen @ 2008-10-28 18:37 UTC (permalink / raw)
  To: containers

[-- Attachment #1: Type: text/plain, Size: 3258 bytes --]

I hate the syscall.  It's a very un-Linux-y way of doing things.  There,
I said it.  Here's an alternative.  It still uses the syscall to
initiate things, but it uses debugfs to transport the data instead.
This is just a concept demonstration.  It doesn't actually work, and I
wouldn't be using debugfs in practice.

System calls in Linux are fast.  Doing lots of them is not a problem.
If it becomes one, we can always export a condensed version of this
format next to the expanded one, kinda like ftrace does.  Atomicity with
this approach is also not a problem.  The system call in this approach
doesn't return until the checkpoint is completely written out.

This lets userspace pick and choose what parts of the checkpoint it
cares about.  It enables us to do all the I/O from userspace: no
in-kernel sys_read/write().  I think this interface is much more
flexible than a plain syscall.

Want to do a fast checkpoint?  Fine, copy all data, use a lot of memory,
store it in-kernel.  Dump that out when the filesystem is accessed.
Destroy it when userspace asks.

Want to do a checkpoint with a small memory footprint?
10 write one struct
20 wait for userspace
30 goto 10

Userspace can loop like it is reading a pipe.  We could even track
per-checkpoint memory usage in the cr_ctx and stop writing when we go
over a certain memory threshold.

We can have two modes, internally.  Userspace never has to know what
which one we've chosen.  Say we have a word of data to output.  We can
either make a copy at sys_checkpoint() time and let the data continue to
be modified (let the task run).  Or, we can keep the task frozen and
generate data at debugfs read() time.  This means potentially zero
copying of data until userspace wants it.

The same goes for structures which might have complicated locking or
lifetime rules.  

This also shows how we might handle shared objects.

To use, just sys_checkpoint() as before, and look at /sys/kernel/debug/.
Use the crid you got back from the syscall to locate your checkpoint.
Write into the 'done' file when you want the sys_checkpoint() to return.

/sys/kernel/debug/checkpoint-1/
/sys/kernel/debug/checkpoint-1/done
/sys/kernel/debug/checkpoint-1/task-1141
/sys/kernel/debug/checkpoint-1/task-1141/fds
/sys/kernel/debug/checkpoint-1/task-1141/fds/1
/sys/kernel/debug/checkpoint-1/task-1141/fds/1/coe
/sys/kernel/debug/checkpoint-1/task-1141/fds/1/fd_nr
/sys/kernel/debug/checkpoint-1/task-1141/fds/1/fd
/sys/kernel/debug/checkpoint-1/task-1141/fds/0
/sys/kernel/debug/checkpoint-1/task-1141/fds/0/coe
/sys/kernel/debug/checkpoint-1/task-1141/fds/0/fd_nr
/sys/kernel/debug/checkpoint-1/task-1141/fds/0/fd
/sys/kernel/debug/checkpoint-1/files
/sys/kernel/debug/checkpoint-1/files/2
/sys/kernel/debug/checkpoint-1/files/2/f_version
/sys/kernel/debug/checkpoint-1/files/2/f_pos
/sys/kernel/debug/checkpoint-1/files/2/f_mode
/sys/kernel/debug/checkpoint-1/files/2/f_flags
/sys/kernel/debug/checkpoint-1/files/1
/sys/kernel/debug/checkpoint-1/files/1/target
/sys/kernel/debug/checkpoint-1/files/1/fd_type
/sys/kernel/debug/checkpoint-1/files/1/f_version
/sys/kernel/debug/checkpoint-1/files/1/f_pos
/sys/kernel/debug/checkpoint-1/files/1/f_mode
/sys/kernel/debug/checkpoint-1/files/1/f_flags

So, why not?

-- Dave

[-- Attachment #2: debugfs-fun0.patch --]
[-- Type: text/x-patch, Size: 9039 bytes --]


index 9c2d949..f4eb855 100644
DESC
debugfs-fun1
EDESC

---

 linux-2.6.git-dave/arch/x86/mm/checkpoint.c       |   28 +++++++++++++
 linux-2.6.git-dave/checkpoint/checkpoint.c        |   21 ++++++++++
 linux-2.6.git-dave/checkpoint/ckpt_file.c         |   13 ------
 linux-2.6.git-dave/checkpoint/sys.c               |   45 +++++++++++++++++++++-
 linux-2.6.git-dave/include/linux/checkpoint.h     |    8 +++
 linux-2.6.git-dave/include/linux/checkpoint_hdr.h |   14 ++++--
 6 files changed, 110 insertions(+), 19 deletions(-)

diff -puN arch/x86/mm/checkpoint.c~debugfs-fun0 arch/x86/mm/checkpoint.c
--- linux-2.6.git/arch/x86/mm/checkpoint.c~debugfs-fun0	2008-10-23 10:27:13.000000000 -0700
+++ linux-2.6.git-dave/arch/x86/mm/checkpoint.c	2008-10-23 10:27:13.000000000 -0700
@@ -11,9 +11,32 @@
 #include <asm/desc.h>
 #include <asm/i387.h>
 
+#include <linux/debugfs.h>
+
 #include <linux/checkpoint.h>
 #include <linux/checkpoint_hdr.h>
 
+struct dentry *blobhelp(const char *name, mode_t mode,
+			struct dentry *parent, void *blob, int size)
+{
+	struct debugfs_blob_wrapper *wrap = kmalloc(sizeof(*wrap), GFP_KERNEL);
+	wrap->data = kmalloc(size, GFP_KERNEL);
+	memcpy(wrap->data, blob, size);
+	wrap->size = size;
+	return debugfs_create_blob(name, mode, parent, wrap);
+}
+
+char *tdir(u32 pid)
+{
+	char *buf;
+	// 7 for 'thread-'
+	// 10 for 32-bit int
+	// 1 for \0
+	buf = kmalloc(18, GFP_KERNEL);
+	sprintf(buf, "thread-%d", pid);
+	return buf;
+}
+
 /* dump the thread_struct of a given task */
 int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
 {
@@ -23,10 +46,12 @@ int cr_write_thread(struct cr_ctx *ctx, 
 	struct desc_struct *desc;
 	int ntls = 0;
 	int n, ret;
+	struct dentry *dir;
 
 	h.type = CR_HDR_THREAD;
 	h.len = sizeof(*hh);
 	h.parent = task_pid_vnr(t);
+	dir = debugfs_create_dir(tdir(h.parent), ctx->debugfs_dir);
 
 	thread = &t->thread;
 
@@ -40,6 +65,8 @@ int cr_write_thread(struct cr_ctx *ctx, 
 	hh->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
 	hh->sizeof_tls_array = sizeof(thread->tls_array);
 	hh->ntls = ntls;
+	debugfs_create_u16("ntls", 0444, dir, &hh->ntls);
+	debugfs_create_u16("gdt_entry_tls_entries", 0444, dir, &hh->gdt_entry_tls_entries);
 
 	ret = cr_write_obj(ctx, &h, hh);
 	cr_hbuf_put(ctx, sizeof(*hh));
@@ -48,6 +75,7 @@ int cr_write_thread(struct cr_ctx *ctx, 
 
 	/* for simplicity dump the entire array, cherry-pick upon restart */
 	ret = cr_kwrite(ctx, thread->tls_array, sizeof(thread->tls_array));
+	blobhelp("tls_array", 0444, dir, thread->tls_array, sizeof(thread->tls_array));
 
 	cr_debug("ntls %d\n", ntls);
 
diff -puN checkpoint/sys.c~debugfs-fun0 checkpoint/sys.c
--- linux-2.6.git/checkpoint/sys.c~debugfs-fun0	2008-10-23 10:27:13.000000000 -0700
+++ linux-2.6.git-dave/checkpoint/sys.c	2008-10-28 11:18:04.000000000 -0700
@@ -8,6 +8,7 @@
  *  distribution for more details.
  */
 
+#include <linux/debugfs.h>
 #include <linux/sched.h>
 #include <linux/nsproxy.h>
 #include <linux/ptrace.h>
@@ -147,7 +148,7 @@ void *cr_hbuf_get(struct cr_ctx *ctx, in
 void cr_hbuf_put(struct cr_ctx *ctx, int n)
 {
 	BUG_ON(ctx->hpos < n);
-	ctx->hpos -= n;
+	//ctx->hpos -= n;
 }
 
 /*
@@ -217,11 +218,12 @@ static void cr_ctx_free(struct cr_ctx *c
 	if (ctx->file)
 		fput(ctx->file);
 
-	kfree(ctx->hbuf);
+	//kfree(ctx->hbuf);
 
 	if (ctx->vfsroot)
 		path_put(ctx->vfsroot);
 
+	return;
 	cr_pgarr_free(ctx);
 	cr_objhash_free(ctx);
 
@@ -269,6 +271,12 @@ static struct cr_ctx *cr_ctx_alloc(pid_t
 
 	ctx->crid = atomic_inc_return(&cr_ctx_count);
 
+	{
+		char buf[32];
+		sprintf(&buf[0], "checkpoint-%d", ctx->crid);
+		ctx->debugfs_dir = debugfs_create_dir(&buf[0], NULL);
+		ctx->fd_dir = debugfs_create_dir("files", ctx->debugfs_dir);
+	}
 	return ctx;
 
  err:
@@ -276,6 +284,30 @@ static struct cr_ctx *cr_ctx_alloc(pid_t
 	return ERR_PTR(err);
 }
 
+/*
+ * Copied from debugfs, needs cleanup
+ */
+static int default_open(struct inode *inode, struct file *file)
+{
+	if (inode->i_private)
+		file->private_data = inode->i_private;
+
+	return 0;
+}
+
+static ssize_t cr_debugfs_done(struct file *file, const char __user *user_buf,
+		                               size_t count, loff_t *ppos)
+{
+	struct cr_ctx *ctx = file->private_data;
+	mutex_unlock(&ctx->mutex_done);
+	return count;
+}
+
+static const struct file_operations debugfs_done_fops = {
+	.write = cr_debugfs_done,
+	.open =	 default_open,
+};
+
 /**
  * sys_checkpoint - checkpoint a container
  * @pid: pid of the container init(1) process
@@ -303,6 +335,14 @@ asmlinkage long sys_checkpoint(pid_t pid
 	if (!ret)
 		ret = ctx->crid;
 
+	/*
+	 * Wait for userspace to consume the image
+	 */
+	mutex_init(&ctx->mutex_done);
+	debugfs_create_file("done", 0200, ctx->debugfs_dir,
+				ctx, &debugfs_done_fops);
+	mutex_lock(&ctx->mutex_done);
+	mutex_lock(&ctx->mutex_done);
 	cr_ctx_free(ctx);
 	return ret;
 }
@@ -334,3 +374,4 @@ asmlinkage long sys_restart(int crid, in
 	cr_ctx_free(ctx);
 	return ret;
 }
+
diff -puN include/linux/checkpoint.h~debugfs-fun0 include/linux/checkpoint.h
--- linux-2.6.git/include/linux/checkpoint.h~debugfs-fun0	2008-10-23 10:27:13.000000000 -0700
+++ linux-2.6.git-dave/include/linux/checkpoint.h	2008-10-28 10:57:39.000000000 -0700
@@ -36,6 +36,11 @@ struct cr_ctx {
 	struct list_head pgarr_list;	/* page array to dump VMA contents */
 
 	struct path *vfsroot;	/* container root (FIXME) */
+
+	struct mutex mutex_done;
+	struct dentry *debugfs_dir;
+	struct dentry *fd_dir;
+	struct dentry *current_task_dir;
 };
 
 /* cr_ctx: flags */
@@ -73,7 +78,8 @@ struct cr_hdr;
 extern int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf);
 extern int cr_write_string(struct cr_ctx *ctx, char *str, int len);
 extern int cr_write_fname(struct cr_ctx *ctx,
-			  struct path *path, struct path *root);
+			  struct path *path, struct path *root,
+			  struct dentry *debugfs_dir);
 
 extern int cr_read_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf, int n);
 extern int cr_read_obj_type(struct cr_ctx *ctx, void *buf, int n, int type);
diff -puN include/linux/checkpoint_hdr.h~debugfs-fun0 include/linux/checkpoint_hdr.h
--- linux-2.6.git/include/linux/checkpoint_hdr.h~debugfs-fun0	2008-10-23 10:27:13.000000000 -0700
+++ linux-2.6.git-dave/include/linux/checkpoint_hdr.h	2008-10-23 10:27:13.000000000 -0700
@@ -22,10 +22,16 @@
 
 /* records: generic header */
 
+typedef int (cr_hdr_op)(struct cr_ctx *ctx, struct cr_hdr *cr_hdr, void *private);
+
 struct cr_hdr {
 	__s16 type;
 	__s16 len;
 	__u32 parent;
+
+	void *data;
+	cr_hdr_op *cr_op;
+	void *buf; /* of length len ^^ */
 };
 
 /* header types */
@@ -34,20 +40,20 @@ enum {
 	CR_HDR_STRING,
 	CR_HDR_FNAME,
 
-	CR_HDR_TASK = 101,
+	CR_HDR_TASK,
 	CR_HDR_THREAD,
 	CR_HDR_CPU,
 
-	CR_HDR_MM = 201,
+	CR_HDR_MM,
 	CR_HDR_VMA,
 	CR_HDR_PGARR,
 	CR_HDR_MM_CONTEXT,
 
-	CR_HDR_FILES = 301,
+	CR_HDR_FILES,
 	CR_HDR_FD_ENT,
 	CR_HDR_FD_DATA,
 
-	CR_HDR_TAIL = 5001
+	CR_HDR_TAIL
 };
 
 struct cr_hdr_head {
diff -puN security/Makefile~debugfs-fun0 security/Makefile
diff -puN checkpoint/checkpoint.c~debugfs-fun0 checkpoint/checkpoint.c
--- linux-2.6.git/checkpoint/checkpoint.c~debugfs-fun0	2008-10-28 11:18:04.000000000 -0700
+++ linux-2.6.git-dave/checkpoint/checkpoint.c	2008-10-28 11:18:04.000000000 -0700
@@ -191,6 +191,26 @@ static int cr_write_task_struct(struct c
 	return cr_write_string(ctx, t->comm, TASK_COMM_LEN);
 }
 
+int cr_create_task_dir(struct cr_ctx *ctx, struct task_struct *t)
+{
+	char buf[22];
+	// 11 for 'thread--fds'
+	// 10 for 32-bit int
+	// 1 for \0
+	sprintf(buf, "task-%d", task_pid_vnr(t));
+
+	/*
+	 * This is not very nice to hide in here, so
+	 * eventually just pass this around or make
+	 * a cr-specific on-stack structure just for
+	 * tasks.
+	 */
+	ctx->current_task_dir =
+		debugfs_create_dir(&buf[0], ctx->debugfs_dir);
+
+	return 0;
+}
+
 /* dump the entire state of a given task */
 static int cr_write_task(struct cr_ctx *ctx, struct task_struct *t)
 {
@@ -203,6 +223,7 @@ static int cr_write_task(struct cr_ctx *
 		return -EAGAIN;
 	}
 
+	cr_create_task_dir(ctx, t);
 	ret = cr_write_task_struct(ctx, t);
 	cr_debug("task_struct: ret %d\n", ret);
 	if (ret < 0)
diff -puN checkpoint/ckpt_file.c~debugfs-fun0 checkpoint/ckpt_file.c
--- linux-2.6.git/checkpoint/ckpt_file.c~debugfs-fun0	2008-10-28 11:18:04.000000000 -0700
+++ linux-2.6.git-dave/checkpoint/ckpt_file.c	2008-10-28 11:18:04.000000000 -0700
@@ -216,17 +216,6 @@ out:
 	return ret;
 }
 
-static char *tfddir(u32 pid)
-{
-	char *buf;
-	// 11 for 'thread--fds'
-	// 10 for 32-bit int
-	// 1 for \0
-	buf = kmalloc(22, GFP_KERNEL);
-	sprintf(buf, "thread-%d-fds", pid);
-	return buf;
-}
-
 int cr_write_files(struct cr_ctx *ctx, struct task_struct *t)
 {
 	struct cr_hdr h;
@@ -239,7 +228,7 @@ int cr_write_files(struct cr_ctx *ctx, s
 	h.type = CR_HDR_FILES;
 	h.len = sizeof(*hh);
 	h.parent = task_pid_vnr(t);
-	dir = debugfs_create_dir(tfddir(h.parent), ctx->debugfs_dir);
+	dir = debugfs_create_dir("fds", ctx->current_task_dir);
 
 	files = get_files_struct(t);
 
_

[-- Attachment #3: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
  2008-10-28 18:37 [BIG RFC] Filesystem-based checkpoint Dave Hansen
@ 2008-10-28 20:56 ` Serge E. Hallyn
       [not found]   ` <20081028205654.GA17487-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-10-30 18:19 ` Oren Laadan
  2008-10-30 23:33 ` Eric W. Biederman
  2 siblings, 1 reply; 26+ messages in thread
From: Serge E. Hallyn @ 2008-10-28 20:56 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers

Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> I hate the syscall.  It's a very un-Linux-y way of doing things.  There,

Not really the syscall, but the writing to the file from the kernel.
Any time I see set_fs(KERNEL_DS) i get flashbacks to getting yelled at
in the 90s :)

> I said it.  Here's an alternative.  It still uses the syscall to
> initiate things, but it uses debugfs to transport the data instead.
> This is just a concept demonstration.  It doesn't actually work, and I
> wouldn't be using debugfs in practice.

It's neat how few lines this took, but I would prefer using a tiny
custom fs rather than use debugfs for dump and configfs for restore.

If you like I can take a shot at whipping up the new mini-fs, though
I think you're having fun :)

-serge

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]   ` <20081028205654.GA17487-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-10-28 21:00     ` Dave Hansen
  2008-10-28 21:10     ` Dave Hansen
  1 sibling, 0 replies; 26+ messages in thread
From: Dave Hansen @ 2008-10-28 21:00 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: containers

On Tue, 2008-10-28 at 15:56 -0500, Serge E. Hallyn wrote:
> Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> > I hate the syscall.  It's a very un-Linux-y way of doing things.  There,
> 
> Not really the syscall, but the writing to the file from the kernel.
> Any time I see set_fs(KERNEL_DS) i get flashbacks to getting yelled at
> in the 90s :)

Heh.  You security whackos are always getting yelled at for _something_
anyway.

> > I said it.  Here's an alternative.  It still uses the syscall to
> > initiate things, but it uses debugfs to transport the data instead.
> > This is just a concept demonstration.  It doesn't actually work, and I
> > wouldn't be using debugfs in practice.
> 
> It's neat how few lines this took, but I would prefer using a tiny
> custom fs rather than use debugfs for dump and configfs for restore.

Yeah, doing a new FS would certainly be a ton more code.  But, I think
the most important part ends up being how complicated it ends up being
in practice.

It may turn out that refactoring some existing debug/configfs code might
be enough to get us there without too much new code *just* for us.

> If you like I can take a shot at whipping up the new mini-fs, though
> I think you're having fun :)

I need to look into what configfs can give me, next.  I'll keep
playing. :)

I really just wanted to know what Oren and Andrey thought.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]   ` <20081028205654.GA17487-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-10-28 21:00     ` Dave Hansen
@ 2008-10-28 21:10     ` Dave Hansen
  2008-10-30 16:25       ` Oren Laadan
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2008-10-28 21:10 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: containers

On Tue, 2008-10-28 at 15:56 -0500, Serge E. Hallyn wrote:
> If you like I can take a shot at whipping up the new mini-fs, though
> I think you're having fun :)

There are a couple of concepts that just get easier once you start
thinking of this as an entire fs too.  For instance, cr_ctx just becomes
crfs_sb.  For things like dumping in parallel, we get locking and
lifetime rules for free from the vfs.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
  2008-10-28 21:10     ` Dave Hansen
@ 2008-10-30 16:25       ` Oren Laadan
       [not found]         ` <4909E000.9070201-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Oren Laadan @ 2008-10-30 16:25 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers



Dave Hansen wrote:
> On Tue, 2008-10-28 at 15:56 -0500, Serge E. Hallyn wrote:
>> If you like I can take a shot at whipping up the new mini-fs, though
>> I think you're having fun :)
> 
> There are a couple of concepts that just get easier once you start
> thinking of this as an entire fs too.  For instance, cr_ctx just becomes
> crfs_sb.  For things like dumping in parallel, we get locking and
> lifetime rules for free from the vfs.

Well, 'cr_ctx' is per-checkpoint, while crfs_sb will single for the
entire system. So you'll need to add something per checkpoint anyway.

What other concepts get easier ?

Oren.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]         ` <4909E000.9070201-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-10-30 16:36           ` Dave Hansen
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Hansen @ 2008-10-30 16:36 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers

On Thu, 2008-10-30 at 12:25 -0400, Oren Laadan wrote:
> Dave Hansen wrote:
> > On Tue, 2008-10-28 at 15:56 -0500, Serge E. Hallyn wrote:
> >> If you like I can take a shot at whipping up the new mini-fs, though
> >> I think you're having fun :)
> > 
> > There are a couple of concepts that just get easier once you start
> > thinking of this as an entire fs too.  For instance, cr_ctx just becomes
> > crfs_sb.  For things like dumping in parallel, we get locking and
> > lifetime rules for free from the vfs.
> 
> Well, 'cr_ctx' is per-checkpoint, while crfs_sb will single for the
> entire system. So you'll need to add something per checkpoint anyway.

I was thinking of it more along the lines of requiring a new filesystem
mount for each checkpoint.  That way, we dispose of the checkpoint by
the act of unmounting.

> What other concepts get easier ?

The amount of infrastructure needed to do lookups for shared objects
goes to zero.  We don't need a hash table or ids with which we index
into that table.  Filesystems are good at giving things names and
finding them later.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
  2008-10-28 18:37 [BIG RFC] Filesystem-based checkpoint Dave Hansen
  2008-10-28 20:56 ` Serge E. Hallyn
@ 2008-10-30 18:19 ` Oren Laadan
       [not found]   ` <4909FAA8.5000107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-10-30 23:33 ` Eric W. Biederman
  2 siblings, 1 reply; 26+ messages in thread
From: Oren Laadan @ 2008-10-30 18:19 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers


I'm not sure why you say it's "un-linux-y" to begin with. But to the
point, here are my thought:


1. What you suggest is to expose the internal data to user space and
pull it. Isn't that what cryo tried to do ?  And the conclusion was
that it takes too many interfaces to work out, code in, provide, and
maintain forever, with issues related to backward compatibility and
what not. In fact, the conclusion was "let's do a kernel-blob" !


2. So there is a high price tag for the extra flexibility - more code,
more complexity, more maintenance nightmare, more API fights. But the
real question IMHO is what do you gain from it ?

> This lets userspace pick and choose what parts of the checkpoint it
> cares about.

So what ?  Why do you ever need that ?  What sort of information would
you get from there, that you can't get from existing mechanism (ptrace) ?

If this is only to be able to parallelize checkpoint - then let's discuss
the problem, not a specific solution.

> It enables us to do all the I/O from userspace: no in-kernel
> sys_read/write().

What's so wrong with in-kernel vfs_read/write() ?  You mentioned deadlocks,
but I'm yet to see one and understand the prolbem. My experience with Zap
(and Andrey's with OpenVZ) has been pretty good.

If eventually this becomes the main issue, we can discuss alternatives
(some have been proposed in the past) and again, fit a solution to the
problem as opposed to fit a problem to a solution.

> I think this interface is much more flexible than a plain syscall.

Flexibility can be a friend or an enemy. Can you quantify or qualify what
you gain, for the high cost of going in that direction ?


3. Your approach doesn't play well with what I call "checkpoint that
involves self". This term refers to a process that checkpoints itself
(and only itself), or to a process that attempts to checkpoint its own
container.  In both cases, there is no other entity that will read the
data from the file system while the caller is blocked.


4. I'm not sure how you want to handle shared objects. Simply saying:

> This also shows how we might handle shared objects.

isn't quite convincing. Keep in mind that sharing is determined in kernel,
and in the order that objects are encountered (as they should only be
dumped once). There may be objects that are shared, and themselves refer
to objects that are shared, and such objects are best handles in a bundled
manner (e.g. think of the two fds of a pipe). I really don't see how you
might handle all of that with your suggested scheme.


5. Your suggestions leaves too many details out. Yes, it's a call for
discussion. But still. Zap, OpenVZ and other systems build on experience
and working code. We know how to do incremental, live, and other goodies.
I'm not sure how these would work with your scheme.


6. Performance: in one important use case I checkpoint the entire user
desktop once a second, with downtime (due to checkpoint) of < 15ms for
even busy configurations and large memory footprint. While syscall are
relatively cheap, I wonder if you approach could keep up with it.


Oren.

Dave Hansen wrote:
> I hate the syscall.  It's a very un-Linux-y way of doing things.  There,
> I said it.  Here's an alternative.  It still uses the syscall to
> initiate things, but it uses debugfs to transport the data instead.
> This is just a concept demonstration.  It doesn't actually work, and I
> wouldn't be using debugfs in practice.
> 
> System calls in Linux are fast.  Doing lots of them is not a problem.
> If it becomes one, we can always export a condensed version of this
> format next to the expanded one, kinda like ftrace does.  Atomicity with
> this approach is also not a problem.  The system call in this approach
> doesn't return until the checkpoint is completely written out.
> 
> This lets userspace pick and choose what parts of the checkpoint it
> cares about.  It enables us to do all the I/O from userspace: no
> in-kernel sys_read/write().  I think this interface is much more
> flexible than a plain syscall.
> 
> Want to do a fast checkpoint?  Fine, copy all data, use a lot of memory,
> store it in-kernel.  Dump that out when the filesystem is accessed.
> Destroy it when userspace asks.
> 
> Want to do a checkpoint with a small memory footprint?
> 10 write one struct
> 20 wait for userspace
> 30 goto 10
> 
> Userspace can loop like it is reading a pipe.  We could even track
> per-checkpoint memory usage in the cr_ctx and stop writing when we go
> over a certain memory threshold.
> 
> We can have two modes, internally.  Userspace never has to know what
> which one we've chosen.  Say we have a word of data to output.  We can
> either make a copy at sys_checkpoint() time and let the data continue to
> be modified (let the task run).  Or, we can keep the task frozen and
> generate data at debugfs read() time.  This means potentially zero
> copying of data until userspace wants it.
> 
> The same goes for structures which might have complicated locking or
> lifetime rules.  
> 
> This also shows how we might handle shared objects.
> 
> To use, just sys_checkpoint() as before, and look at /sys/kernel/debug/.
> Use the crid you got back from the syscall to locate your checkpoint.
> Write into the 'done' file when you want the sys_checkpoint() to return.
> 
> /sys/kernel/debug/checkpoint-1/
> /sys/kernel/debug/checkpoint-1/done
> /sys/kernel/debug/checkpoint-1/task-1141
> /sys/kernel/debug/checkpoint-1/task-1141/fds
> /sys/kernel/debug/checkpoint-1/task-1141/fds/1
> /sys/kernel/debug/checkpoint-1/task-1141/fds/1/coe
> /sys/kernel/debug/checkpoint-1/task-1141/fds/1/fd_nr
> /sys/kernel/debug/checkpoint-1/task-1141/fds/1/fd
> /sys/kernel/debug/checkpoint-1/task-1141/fds/0
> /sys/kernel/debug/checkpoint-1/task-1141/fds/0/coe
> /sys/kernel/debug/checkpoint-1/task-1141/fds/0/fd_nr
> /sys/kernel/debug/checkpoint-1/task-1141/fds/0/fd
> /sys/kernel/debug/checkpoint-1/files
> /sys/kernel/debug/checkpoint-1/files/2
> /sys/kernel/debug/checkpoint-1/files/2/f_version
> /sys/kernel/debug/checkpoint-1/files/2/f_pos
> /sys/kernel/debug/checkpoint-1/files/2/f_mode
> /sys/kernel/debug/checkpoint-1/files/2/f_flags
> /sys/kernel/debug/checkpoint-1/files/1
> /sys/kernel/debug/checkpoint-1/files/1/target
> /sys/kernel/debug/checkpoint-1/files/1/fd_type
> /sys/kernel/debug/checkpoint-1/files/1/f_version
> /sys/kernel/debug/checkpoint-1/files/1/f_pos
> /sys/kernel/debug/checkpoint-1/files/1/f_mode
> /sys/kernel/debug/checkpoint-1/files/1/f_flags
> 
> So, why not?
> 
> -- Dave
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]   ` <4909FAA8.5000107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-10-30 19:28     ` Serge E. Hallyn
       [not found]       ` <20081030192817.GA16340-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-10-30 19:37     ` Dave Hansen
  1 sibling, 1 reply; 26+ messages in thread
From: Serge E. Hallyn @ 2008-10-30 19:28 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> 
> I'm not sure why you say it's "un-linux-y" to begin with. But to the

The thing that is un-linux-y is specifically having user-space pass an
fd to the kernel from which it reads/writes.  LSMs had to go to a lot of
pain to avoid doing that for reading policy configuration at boot.

Of course it's now several years later, and moods and tastes change in
the kernel community, but I suspect it's still frowned upon.

> point, here are my thought:
> 
> 
> 1. What you suggest is to expose the internal data to user space and
> pull it. Isn't that what cryo tried to do ?  And the conclusion was
> that it takes too many interfaces to work out, code in, provide, and
> maintain forever, with issues related to backward compatibility and
> what not. In fact, the conclusion was "let's do a kernel-blob" !

Right, the problem with cryo was that it tried to do the checkpoint and
restart themselves at too fine-grained a level in terms of kernel-user
API.

What Dave is suggesting (as I understand it) is just changing the way
the data is shipped between kernel and user-space.  But to continue with
sys_checkpoint() and sys_restart().  So I think it's a less fundamental
change than you are thinking.

Now maybe eventually he's going to propose something more esotaric where
doing the mount() actually starts the checkpoint (that's where I figured
he'd be heading), but I think it would still be one action on the part
of userspace telling the kernel "do a checkpoint".

(Or am I wrong on that, Dave?)

[...]

(I'll let Dave respond to your other questions i.e. about what you gain)

> If this is only to be able to parallelize checkpoint - then let's discuss
> the problem, not a specific solution.

The specific problem is that you have userspace pass a file fd to the
kernel and kernel reading/writing to it, which is un-linuxy.

> > It enables us to do all the I/O from userspace: no in-kernel
> > sys_read/write().
> 
> What's so wrong with in-kernel vfs_read/write() ?  You mentioned deadlocks,

It's un-linux-y :)

[...]

> 5. Your suggestions leaves too many details out. Yes, it's a call for
> discussion. But still. Zap, OpenVZ and other systems build on experience
> and working code. We know how to do incremental, live, and other goodies.
> I'm not sure how these would work with your scheme.

Not sure what problems you envision, but taking the specific example of
pre-dump to prepare for a quick live migration, I could envision a
pre_checkpoint() system call creating the checkpoint data directory
and starting to dump out the data, and starting to copy that data
over the network (optimistically), after which the do_checkpoint()
syscall checks file timestamps and quickly dumps and network-copies the
data which has changed up until the container was frozen.

-serge

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]   ` <4909FAA8.5000107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-10-30 19:28     ` Serge E. Hallyn
@ 2008-10-30 19:37     ` Dave Hansen
  2008-10-30 20:15       ` Oren Laadan
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2008-10-30 19:37 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers

On Thu, 2008-10-30 at 14:19 -0400, Oren Laadan wrote:
> I'm not sure why you say it's "un-linux-y" to begin with. But to the
> point, here are my thought:
> 
> 1. What you suggest is to expose the internal data to user space and
> pull it. Isn't that what cryo tried to do ?

No, cryo attempted to use existing kernel interfaces when they exist,
and create new ones in different places one at a time.

> And the conclusion was
> that it takes too many interfaces to work out, code in, provide, and
> maintain forever, with issues related to backward compatibility and
> what not.

You may have concluded that. :)

> In fact, the conclusion was "let's do a kernel-blob" !

This is a blob.  It's simply a blob exported in a filesystem.  Note that
it exports the same format as the 'big blob' with the same types.  Stick
a couple of cr_hdr* objects on to what we have in the filesystem, and we
get the same blob that we have now.

How would a tarball of this filesystem be any less of a blob than the
output from sys_checkpoint() is now?

> 2. So there is a high price tag for the extra flexibility - more code,
> more complexity, more maintenance nightmare, more API fights. But the
> real question IMHO is what do you gain from it ?

I think I've shown here that it can be done in a tremendously small
amount of code.  There are no more API fights than what we would have
now for each additional type of 'struct cr_something' that the syscall
would spit out.

> > This lets userspace pick and choose what parts of the checkpoint it
> > cares about.
> 
> So what ?  Why do you ever need that ?

The simplest example would be checkpointing 'cat > some_file'.  Perhaps
the restorer doesn't want to write to some_file.  The important thing to
them is to get the stdout and not redirect it.  This gets down to the
"what fds do you checkpoint" problem.  We've discussed this, and your
approach is to add another kernel interface which flags fds before the
checkpoint.  Right?  This would obviate the need for such an interface
inside the kernel.

> If this is only to be able to parallelize checkpoint - then let's discuss
> the problem, not a specific solution.

This approach parallelizes naturally.  There's no additional code in the
kernel to handle it.  It certainly isn't the only reason, though.

> > It enables us to do all the I/O from userspace: no in-kernel
> > sys_read/write().
> 
> What's so wrong with in-kernel vfs_read/write() ?  You mentioned deadlocks,
> but I'm yet to see one and understand the prolbem. My experience with Zap
> (and Andrey's with OpenVZ) has been pretty good.
> 
> If eventually this becomes the main issue, we can discuss alternatives
> (some have been proposed in the past) and again, fit a solution to the
> problem as opposed to fit a problem to a solution.

As Andrew said, this is a very unconventional way of doing things.  My
approach is certainly more conventional, and proved to work.  We should
have very, very good reasons for departing from what we know to work. 


> 3. Your approach doesn't play well with what I call "checkpoint that
> involves self". This term refers to a process that checkpoints itself
> (and only itself), or to a process that attempts to checkpoint its own
> container.  In both cases, there is no other entity that will read the
> data from the file system while the caller is blocked.

I would propose an in-userspace solution for this issue.  If a process
wants to checkpoint itself, it must first fork and let the forked
process do the checkpoint.

In practice, I expect self-checkpoint to be a very small minority of the
use of this feature.  Applications smart enough to self-checkpoint are
probably smart enough not to need to. 

> 4. I'm not sure how you want to handle shared objects. Simply saying:
> 
> > This also shows how we might handle shared objects.
> 
> isn't quite convincing. Keep in mind that sharing is determined in kernel,
> and in the order that objects are encountered (as they should only be
> dumped once). There may be objects that are shared, and themselves refer
> to objects that are shared, and such objects are best handles in a bundled
> manner (e.g. think of the two fds of a pipe). I really don't see how you
> might handle all of that with your suggested scheme.

In all fairness, what you posted doesn't show pipes, either. :)

But, in your approach, you would be reading from the 'struct
cr_hdr_files' and you would see a pipe fd along with its identifier in
the cr_hdr_fd_ent->objref field.  You would do a lookup in the hash
table on that objref and either return a pipe if one is there, or create
a new one if the other end hasn't been seen yet.  Right?

All we need to export with my scheme is the inode nr in the pipe
filesystem and the fact that the pipe is a pipe.  In other words, create
something like this:

/sys/kernel/debug/checkpoint-1/files/2/f_isapipe
/sys/kernel/debug/checkpoint-1/files/2/f_inode_nr

Just substitute whatever flags or things you would have used inside
'cr_hdr_fd_ent' to denote the presence of a pipe.  This could use the
same.

If we were doing a configfs-style restart, the restarter would simply
restore those two files.  The act of doing open(O_CREAT) is the same
trigger as what you have now when a cr_hdr of some type is encountered.

> 5. Your suggestions leaves too many details out. Yes, it's a call for
> discussion. But still. Zap, OpenVZ and other systems build on experience
> and working code. We know how to do incremental, live, and other goodies.
> I'm not sure how these would work with your scheme.

Well, we haven't even gotten to memory, yet.  For incremental and live,
virtually all the data is memory contents, right?

I understand this is *different* from what you're using, and that
reduces your confidence in it.  That's unavoidable.  But, can you share
your insight into incremental and live checkpointing to point out things
which conflict with this approach?

> 6. Performance: in one important use case I checkpoint the entire user
> desktop once a second, with downtime (due to checkpoint) of < 15ms for
> even busy configurations and large memory footprint. While syscall are
> relatively cheap, I wonder if you approach could keep up with it.

Again, I think this all comes down to how we do memory.  If we have one
file per byte of memory, I think we'll see syscall overhead.  All of the
other data that gets transferred is going to be teeny compared to
memory.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]       ` <20081030192817.GA16340-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-10-30 19:39         ` Dave Hansen
  2008-10-30 19:50           ` Serge E. Hallyn
  2008-10-30 19:47         ` Oren Laadan
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2008-10-30 19:39 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: containers

On Thu, 2008-10-30 at 14:28 -0500, Serge E. Hallyn wrote:
> Now maybe eventually he's going to propose something more esotaric where
> doing the mount() actually starts the checkpoint (that's where I figured
> he'd be heading), but I think it would still be one action on the part
> of userspace telling the kernel "do a checkpoint".
> 
> (Or am I wrong on that, Dave?)

I don't really care how it is initiated.  If a checkpoint was initiated
by sys_mount() with special mount options, I don't see a real
distinction between that and sys_checkpoint().  Or, a special ioctl() on
a special device file for that matter.  How we initiate it isn't
important to me.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]       ` <20081030192817.GA16340-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-10-30 19:39         ` Dave Hansen
@ 2008-10-30 19:47         ` Oren Laadan
       [not found]           ` <490A0F67.5000303-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 1 reply; 26+ messages in thread
From: Oren Laadan @ 2008-10-30 19:47 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: containers, Dave Hansen



Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
>> I'm not sure why you say it's "un-linux-y" to begin with. But to the
> 
> The thing that is un-linux-y is specifically having user-space pass an
> fd to the kernel from which it reads/writes.  LSMs had to go to a lot of
> pain to avoid doing that for reading policy configuration at boot.
> 
> Of course it's now several years later, and moods and tastes change in
> the kernel community, but I suspect it's still frowned upon.
> 
>> point, here are my thought:
>>
>>
>> 1. What you suggest is to expose the internal data to user space and
>> pull it. Isn't that what cryo tried to do ?  And the conclusion was
>> that it takes too many interfaces to work out, code in, provide, and
>> maintain forever, with issues related to backward compatibility and
>> what not. In fact, the conclusion was "let's do a kernel-blob" !
> 
> Right, the problem with cryo was that it tried to do the checkpoint and
> restart themselves at too fine-grained a level in terms of kernel-user
> API.
> 
> What Dave is suggesting (as I understand it) is just changing the way
> the data is shipped between kernel and user-space.  But to continue with
> sys_checkpoint() and sys_restart().  So I think it's a less fundamental
> change than you are thinking.

Probably true, if you ignore the tree he used to illustrate the idea :o
If we agree on the 'blob' (or nearly 'blob') approach, he should suggest
to export a single file (or one file per task, but that's _it_).

> 
> Now maybe eventually he's going to propose something more esotaric where
> doing the mount() actually starts the checkpoint (that's where I figured
> he'd be heading), but I think it would still be one action on the part
> of userspace telling the kernel "do a checkpoint".

Can you comment on point 3, that is --

  3. Your approach doesn't play well with what I call "checkpoint that
  involves self". This term refers to a process that checkpoints itself
  (and only itself), or to a process that attempts to checkpoint its own
  container.  In both cases, there is no other entity that will read the
  data from the file system while the caller is blocked.

This is a key point for me, with multiple use cases. The simplest, if
you will, is for a process to simply checkpoint itself (no containers
and other crap :p). Same for dumping your own container. And there are
others.

In fact, the question is whether checkpoint is push-based or pull-based.
Push-based is what we have now - kernel pushed data to the fd. Dave
suggests a pull-based approach, where the kernel generated data (ahead
of time or on-demand) in response to user reading it.

My preference to a push-based approach is based on simplicity (see the
code now), point 3 above, and my experience with optimizations such as
incremental checkpoints, pre-dump and post-dump optimizations.

That given, it is possible (but ends up with more complex code) to
convert a push-based approach to a pull-based. Given that I personally
think push-based is easier, and I don't want to give up point 3, I'd
say we should proceed as is, and we can always change back (or support
both) later.

> 
> (Or am I wrong on that, Dave?)
> 
> [...]
> 
> (I'll let Dave respond to your other questions i.e. about what you gain)
> 
>> If this is only to be able to parallelize checkpoint - then let's discuss
>> the problem, not a specific solution.
> 
> The specific problem is that you have userspace pass a file fd to the
> kernel and kernel reading/writing to it, which is un-linuxy.
> 
>>> It enables us to do all the I/O from userspace: no in-kernel
>>> sys_read/write().
>> What's so wrong with in-kernel vfs_read/write() ?  You mentioned deadlocks,
> 
> It's un-linux-y :)
> 
> [...]
> 
>> 5. Your suggestions leaves too many details out. Yes, it's a call for
>> discussion. But still. Zap, OpenVZ and other systems build on experience
>> and working code. We know how to do incremental, live, and other goodies.
>> I'm not sure how these would work with your scheme.
> 
> Not sure what problems you envision, but taking the specific example of
> pre-dump to prepare for a quick live migration, I could envision a
> pre_checkpoint() system call creating the checkpoint data directory
> and starting to dump out the data, and starting to copy that data
> over the network (optimistically), after which the do_checkpoint()
> syscall checks file timestamps and quickly dumps and network-copies the
> data which has changed up until the container was frozen.

I don't envision antyhing.

But, having not-envisioned a few times in the past and then having
eaten-%$^% because of that, I ask myself if the actual implementation
will really turn out to be as simple as writing an idea on a terminal.

The above scheme sounds simple, but is far more complicated than one
can imagine. There are races and many things to track in-kernel while
all this pre-copy takes place. I never implemented it the way Dave
suggests, so I - or you, or him - don't know the implications. Point
is, the burden of proof is on him.

Oren.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
  2008-10-30 19:39         ` Dave Hansen
@ 2008-10-30 19:50           ` Serge E. Hallyn
  0 siblings, 0 replies; 26+ messages in thread
From: Serge E. Hallyn @ 2008-10-30 19:50 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers

Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> On Thu, 2008-10-30 at 14:28 -0500, Serge E. Hallyn wrote:
> > Now maybe eventually he's going to propose something more esotaric where
> > doing the mount() actually starts the checkpoint (that's where I figured
> > he'd be heading), but I think it would still be one action on the part
> > of userspace telling the kernel "do a checkpoint".
> > 
> > (Or am I wrong on that, Dave?)
> 
> I don't really care how it is initiated.  If a checkpoint was initiated
> by sys_mount() with special mount options, I don't see a real
> distinction between that and sys_checkpoint().  Or, a special ioctl() on
> a special device file for that matter.  How we initiate it isn't
> important to me.

Ok, that's what I though.

-serge

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]           ` <490A0F67.5000303-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-10-30 20:03             ` Serge E. Hallyn
  2008-10-30 20:11             ` Dave Hansen
  1 sibling, 0 replies; 26+ messages in thread
From: Serge E. Hallyn @ 2008-10-30 20:03 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers, Dave Hansen

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> 
> 
> Serge E. Hallyn wrote:
> > Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> > What Dave is suggesting (as I understand it) is just changing the way
> > the data is shipped between kernel and user-space.  But to continue with
> > sys_checkpoint() and sys_restart().  So I think it's a less fundamental
> > change than you are thinking.
> 
> Probably true, if you ignore the tree he used to illustrate the idea :o
> If we agree on the 'blob' (or nearly 'blob') approach, he should suggest
> to export a single file (or one file per task, but that's _it_).

Well no.  I'm saying that the problem with cryo was that you had to use
tons of different APIs - and introduce a few new ones - to grab info
about various resources using their own API.  Yes Dave looks to be
making you grab all the info in fine-grained pieces through individual
files, but each bit of info is consumed using the exact same API.

> Can you comment on point 3, that is --
> 
>   3. Your approach doesn't play well with what I call "checkpoint that
>   involves self". This term refers to a process that checkpoints itself
>   (and only itself), or to a process that attempts to checkpoint its own
>   container.  In both cases, there is no other entity that will read the
>   data from the file system while the caller is blocked.

This is where I seem to recall Dave mentioning some crazy scheme where
the task would clone itself and have its clone do the pulling.

Don't get me wrong - I'm not sure what Dave's intentions are, but I
agree with you that we should keep working on pushing your patchset.
If we get a nack based on the set_fs() stuff then we know to go with
Dave's approach (or some other), otherwise Dave can keep pursuing his
idea in his sandbox.  But I do think his idea is cool.  Not all cool
ideas end up being workable, so we'll see...

/me now goes to try lxc-checkpoint with Oren's patches.

-serge

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]           ` <490A0F67.5000303-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-10-30 20:03             ` Serge E. Hallyn
@ 2008-10-30 20:11             ` Dave Hansen
  2008-11-04 21:33               ` Mike Waychison
  1 sibling, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2008-10-30 20:11 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers

On Thu, 2008-10-30 at 15:47 -0400, Oren Laadan wrote:
>   3. Your approach doesn't play well with what I call "checkpoint that
>   involves self". This term refers to a process that checkpoints itself
>   (and only itself), or to a process that attempts to checkpoint its own
>   container.  In both cases, there is no other entity that will read the
>   data from the file system while the caller is blocked.
> 
> This is a key point for me, with multiple use cases. The simplest, if
> you will, is for a process to simply checkpoint itself (no containers
> and other crap :p). Same for dumping your own container. And there are
> others.

Let's take a step back here.  I believe that strictly enforcing this
requirement strictly requires that the checkpoint be done in its
entirety by the kernel.

A process must have its state serialized in a repeatable way.  That
basically precludes  it running during the checkpoint, or having its
state change in any way that isn't atomic.

If a process can't be, itself, running during a checkpoint, then
something running must be performing the checkpoint.  That "something"
must either be another process or the kernel.  Since you've defined the
goal as a self-checkpoint, it *can't* be another process.  So, it *must*
be the kernel.

When it comes down to it, I think this point drives quite a bit of the
implementation.  The cr_kread/write(), for instance.  We *need* the
kernel to do the writing since we've completely precluded userspace from
doing it.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
  2008-10-30 19:37     ` Dave Hansen
@ 2008-10-30 20:15       ` Oren Laadan
       [not found]         ` <490A15F5.6010702-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Oren Laadan @ 2008-10-30 20:15 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers



Dave Hansen wrote:
> On Thu, 2008-10-30 at 14:19 -0400, Oren Laadan wrote:
>> I'm not sure why you say it's "un-linux-y" to begin with. But to the
>> point, here are my thought:
>>
>> 1. What you suggest is to expose the internal data to user space and
>> pull it. Isn't that what cryo tried to do ?
> 
> No, cryo attempted to use existing kernel interfaces when they exist,
> and create new ones in different places one at a time.
> 
>> And the conclusion was
>> that it takes too many interfaces to work out, code in, provide, and
>> maintain forever, with issues related to backward compatibility and
>> what not.
> 
> You may have concluded that. :)

You may have been the only one who didn't conclude that. :)

> 
>> In fact, the conclusion was "let's do a kernel-blob" !
> 
> This is a blob.  It's simply a blob exported in a filesystem.  Note that
> it exports the same format as the 'big blob' with the same types.  Stick
> a couple of cr_hdr* objects on to what we have in the filesystem, and we
> get the same blob that we have now.
> 
> How would a tarball of this filesystem be any less of a blob than the
> output from sys_checkpoint() is now?

It isn't a blob per se - it exposes the structure via the file system;
tomorrow someone will write a program that relies on that structure, and
the next time you wanna change something you open a can of worms.

How likely is this to happen if you used, for instance, a single file in
your file system approach ?

> 
>> 2. So there is a high price tag for the extra flexibility - more code,
>> more complexity, more maintenance nightmare, more API fights. But the
>> real question IMHO is what do you gain from it ?
> 
> I think I've shown here that it can be done in a tremendously small
> amount of code.  There are no more API fights than what we would have
> now for each additional type of 'struct cr_something' that the syscall
> would spit out.

Sure, exporting a file system is relatively small code. The problem is
that it makes the logic of the checkpoint more complex (see my pull-based
vs push-based post). Maintaining the context is more involved.

> 
>>> This lets userspace pick and choose what parts of the checkpoint it
>>> cares about.
>> So what ?  Why do you ever need that ?
> 
> The simplest example would be checkpointing 'cat > some_file'.  Perhaps
> the restorer doesn't want to write to some_file.  The important thing to
> them is to get the stdout and not redirect it.  This gets down to the
> "what fds do you checkpoint" problem.  We've discussed this, and your
> approach is to add another kernel interface which flags fds before the

Nope. That wasn't what I said.

I suggested that user space will have a mechanism to exclude certain
resources, for performance reasons (e.g. madvise() for memory regions
that they don't want be saved, because they are scratch).

I also suggested that user space will modify (filter) the checkpoint
image if they wants resources redirected or anything.

And I also suggested (envisioning, for instance, distributed checkpoint
and building on how Zap does it) that user space will have the option
to tell the kernel, before _restart_ to use a given resource for some
specified resource from the checkpoint image (e.g. use a newly created
socket connection and substitute it for whatever was saved as fd#6 of
task 981, or what not).

> checkpoint.  Right?  This would obviate the need for such an interface
> inside the kernel.

The interface would have to sit somewhere, because it is the application
who decides and tells which resources aren't "important" (my first
suggestion above), and it is generally another process that performs
the checkpoint. How would that other process know which resources are
unimportant for the process, or which resources to change ? they need to
communicate otherwise, no ?

And again, what about self-induced checkpoint ?

> 
>> If this is only to be able to parallelize checkpoint - then let's discuss
>> the problem, not a specific solution.
> 
> This approach parallelizes naturally.  There's no additional code in the
> kernel to handle it.  It certainly isn't the only reason, though.

With one caveat: shared resources - which must be handled from within the
kernel - aren't that trivial to handle in user space therefore.

> 
>>> It enables us to do all the I/O from userspace: no in-kernel
>>> sys_read/write().
>> What's so wrong with in-kernel vfs_read/write() ?  You mentioned deadlocks,
>> but I'm yet to see one and understand the prolbem. My experience with Zap
>> (and Andrey's with OpenVZ) has been pretty good.
>>
>> If eventually this becomes the main issue, we can discuss alternatives
>> (some have been proposed in the past) and again, fit a solution to the
>> problem as opposed to fit a problem to a solution.
> 
> As Andrew said, this is a very unconventional way of doing things.  My
> approach is certainly more conventional, and proved to work.  We should
> have very, very good reasons for departing from what we know to work. 
> 
> 
>> 3. Your approach doesn't play well with what I call "checkpoint that
>> involves self". This term refers to a process that checkpoints itself
>> (and only itself), or to a process that attempts to checkpoint its own
>> container.  In both cases, there is no other entity that will read the
>> data from the file system while the caller is blocked.
> 
> I would propose an in-userspace solution for this issue.  If a process
> wants to checkpoint itself, it must first fork and let the forked
> process do the checkpoint.

That's actually not a bad idea, and actual work could in many cases be
hidden in a library call.

> 
> In practice, I expect self-checkpoint to be a very small minority of the
> use of this feature.  Applications smart enough to self-checkpoint are
> probably smart enough not to need to. 

On the contrary. Many applications are dumb enough to use simple user
space based c/r libraries. Especially HPC, btw. I actually expect many
users to pick this capability, if it's there for free.

In any case, the self-checkpoint you suggest may work well for a single
process, but not quite so for checkpointing your own container. And that
is a very useful feature.

> 
>> 4. I'm not sure how you want to handle shared objects. Simply saying:
>>
>>> This also shows how we might handle shared objects.
>> isn't quite convincing. Keep in mind that sharing is determined in kernel,
>> and in the order that objects are encountered (as they should only be
>> dumped once). There may be objects that are shared, and themselves refer
>> to objects that are shared, and such objects are best handles in a bundled
>> manner (e.g. think of the two fds of a pipe). I really don't see how you
>> might handle all of that with your suggested scheme.
> 
> In all fairness, what you posted doesn't show pipes, either. :)
> 
> But, in your approach, you would be reading from the 'struct
> cr_hdr_files' and you would see a pipe fd along with its identifier in
> the cr_hdr_fd_ent->objref field.  You would do a lookup in the hash
> table on that objref and either return a pipe if one is there, or create
> a new one if the other end hasn't been seen yet.  Right?
> 
> All we need to export with my scheme is the inode nr in the pipe
> filesystem and the fact that the pipe is a pipe.  In other words, create
> something like this:
> 
> /sys/kernel/debug/checkpoint-1/files/2/f_isapipe
> /sys/kernel/debug/checkpoint-1/files/2/f_inode_nr

A very detailed blob indeed; I bet its 5K pages spec book has some
of those "this space intentionally left blank" pages... :p

> 
> Just substitute whatever flags or things you would have used inside
> 'cr_hdr_fd_ent' to denote the presence of a pipe.  This could use the
> same.
> 
> If we were doing a configfs-style restart, the restarter would simply
> restore those two files.  The act of doing open(O_CREAT) is the same
> trigger as what you have now when a cr_hdr of some type is encountered.

What you did not address in your response, is that the thing with shared
resources is that they appear more than once. In your terminology, they
would show up in multiple places in the tree. Then they would be saved
multiple times ?

> 
>> 5. Your suggestions leaves too many details out. Yes, it's a call for
>> discussion. But still. Zap, OpenVZ and other systems build on experience
>> and working code. We know how to do incremental, live, and other goodies.
>> I'm not sure how these would work with your scheme.
> 
> Well, we haven't even gotten to memory, yet.  For incremental and live,
> virtually all the data is memory contents, right?
> 
> I understand this is *different* from what you're using, and that
> reduces your confidence in it.  That's unavoidable.  But, can you share
> your insight into incremental and live checkpointing to point out things
> which conflict with this approach?
> 
>> 6. Performance: in one important use case I checkpoint the entire user
>> desktop once a second, with downtime (due to checkpoint) of < 15ms for
>> even busy configurations and large memory footprint. While syscall are
>> relatively cheap, I wonder if you approach could keep up with it.
> 
> Again, I think this all comes down to how we do memory.  If we have one
> file per byte of memory, I think we'll see syscall overhead.  All of the
> other data that gets transferred is going to be teeny compared to
> memory.
> 
> -- Dave
> 

Don't me wrong: I think the idea is very neat and I've said it before.
I just don't think it's the best fit for our purposes.

I wonder what the others think ?

Oren.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]         ` <490A15F5.6010702-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-10-30 20:40           ` Dave Hansen
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Hansen @ 2008-10-30 20:40 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers

On Thu, 2008-10-30 at 16:15 -0400, Oren Laadan wrote:
> Dave Hansen wrote:
> > This is a blob.  It's simply a blob exported in a filesystem.  Note that
> > it exports the same format as the 'big blob' with the same types.  Stick
> > a couple of cr_hdr* objects on to what we have in the filesystem, and we
> > get the same blob that we have now.
> > 
> > How would a tarball of this filesystem be any less of a blob than the
> > output from sys_checkpoint() is now?
> 
> It isn't a blob per se - it exposes the structure via the file system;
> tomorrow someone will write a program that relies on that structure, and
> the next time you wanna change something you open a can of worms.

This is an ABI that I'm proposing.  But, so is the blob from the
syscall.  Are you saying that people can't write programs that depend on
the structure of the data returned from the syscall?  

> How likely is this to happen if you used, for instance, a single file in
> your file system approach ?

Definite.  Just as people will write programs to access only parts of
the resultant sys_checkpoint() files.

> > If we were doing a configfs-style restart, the restarter would simply
> > restore those two files.  The act of doing open(O_CREAT) is the same
> > trigger as what you have now when a cr_hdr of some type is encountered.
> 
> What you did not address in your response, is that the thing with shared
> resources is that they appear more than once. In your terminology, they
> would show up in multiple places in the tree. Then they would be saved
> multiple times ?

*References* will show up more than once.  But, filesystems handle
references today with symlinks or hard links.  We could either do that
or force userspace to do the duplicate and sharing detection itself.

You could get nice and creative here.  For instance, look at the link
count on a file.  If it is 1, write out the record for that resource
into the checkpoint file.  If it is >1, then write out a reference and
unlink the file.  

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
  2008-10-28 18:37 [BIG RFC] Filesystem-based checkpoint Dave Hansen
  2008-10-28 20:56 ` Serge E. Hallyn
  2008-10-30 18:19 ` Oren Laadan
@ 2008-10-30 23:33 ` Eric W. Biederman
       [not found]   ` <m163n9y7yb.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  2 siblings, 1 reply; 26+ messages in thread
From: Eric W. Biederman @ 2008-10-30 23:33 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers

Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> I hate the syscall.  It's a very un-Linux-y way of doing things.  There,
> I said it.  Here's an alternative.  It still uses the syscall to
> initiate things, but it uses debugfs to transport the data instead.
> This is just a concept demonstration.  It doesn't actually work, and I
> wouldn't be using debugfs in practice.

A syscall is a very linux-y way to do it.

If you called it a core dump instead of a checkpoint you have exactly the same set
of issues.

Why we are doing vfs_write instead of file->f_op->write I don't understand.

> System calls in Linux are fast.  Doing lots of them is not a problem.
> If it becomes one, we can always export a condensed version of this
> format next to the expanded one, kinda like ftrace does.  Atomicity with
> this approach is also not a problem.  The system call in this approach
> doesn't return until the checkpoint is completely written out.

Extra copies for something (memory) you want to transfer quickly
and efficiently is a problem.

Reading the memory of another process is a problem, to the point
that the /proc/<pid>/mem interface has been removed from the kernel.
  
> This lets userspace pick and choose what parts of the checkpoint it
> cares about.  It enables us to do all the I/O from userspace: no
> in-kernel sys_read/write().  I think this interface is much more
> flexible than a plain syscall.

Then get with Roland McGraff and build the next generation user
space debugging interface.

> Want to do a fast checkpoint?  Fine, copy all data, use a lot of memory,
> store it in-kernel.  Dump that out when the filesystem is accessed.
> Destroy it when userspace asks.

> So, why not?

Besides the part of creating a bunch of questionable interfaces
that we need to support forever.

Ultimately the question is how do you do checkpoint restore and I just
don't see that happening with a filesystem interface.  Way way way too many
dangerous syscalls that are only needed for one thing.

Checkpoint/Restore are an atomic operation, and filesystems suck and building
high level atomic primitives.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]   ` <m163n9y7yb.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-10-31  0:09     ` Dave Hansen
  2008-10-31  3:12       ` Eric W. Biederman
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2008-10-31  0:09 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: containers

On Thu, 2008-10-30 at 16:33 -0700, Eric W. Biederman wrote:
> Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:
> > I hate the syscall.  It's a very un-Linux-y way of doing things.  There,
> > I said it.  Here's an alternative.  It still uses the syscall to
> > initiate things, but it uses debugfs to transport the data instead.
> > This is just a concept demonstration.  It doesn't actually work, and I
> > wouldn't be using debugfs in practice.
> 
> A syscall is a very linux-y way to do it.

Darn, I thought I'd be able to sneak that one by.

> If you called it a core dump instead of a checkpoint you have exactly the same set
> of issues.

I completely agree with you that there's a lot of common ground here
between coredumps and checkpoints.  I'm not aware of any applications
like, let's say Oracle, that use coredumps in the process of normal
execution.  Checkpoints must be more scalable and lower overhead than
coredumps are.

> Why we are doing vfs_write instead of file->f_op->write I don't understand.

That's an excellent question.  I assume you're asking because at least
the elf core dump code uses it, right?

> > System calls in Linux are fast.  Doing lots of them is not a problem.
> > If it becomes one, we can always export a condensed version of this
> > format next to the expanded one, kinda like ftrace does.  Atomicity with
> > this approach is also not a problem.  The system call in this approach
> > doesn't return until the checkpoint is completely written out.
> 
> Extra copies for something (memory) you want to transfer quickly
> and efficiently is a problem.

That's definitely true.  But, as I said, this approach isn't bound to
copying everything.  We have the flexibility to choose what we do.

> Reading the memory of another process is a problem, to the point
> that the /proc/<pid>/mem interface has been removed from the kernel.

Yes, this is certainly true.  All of the ptrace-related security issues
surely tell us something.  But, I'm not sure of your point here.  Are
you saying that using sys_checkpoint() to dump a process's pages is
inherently safer than approach that uses a filesystem in order to do the
same?

> > Want to do a fast checkpoint?  Fine, copy all data, use a lot of memory,
> > store it in-kernel.  Dump that out when the filesystem is accessed.
> > Destroy it when userspace asks.
> 
> > So, why not?
> 
> Besides the part of creating a bunch of questionable interfaces
> that we need to support forever.
> 
> Ultimately the question is how do you do checkpoint restore and I just
> don't see that happening with a filesystem interface.  Way way way too many
> dangerous syscalls that are only needed for one thing.

I completely understand what you're saying here.  But, could you
distinguish how this differs from the current way that sys_checkpoint()
does it?  Surely, the checkpoint format is an ABI.  It is a complex ABI
with many, many constituent structures.  This is an ABI with many, many,
ways of reading simple data.  Seems like just slicing up the problem
differently to me.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
  2008-10-31  0:09     ` Dave Hansen
@ 2008-10-31  3:12       ` Eric W. Biederman
       [not found]         ` <m1k5bpwj8j.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Eric W. Biederman @ 2008-10-31  3:12 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers

Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

> On Thu, 2008-10-30 at 16:33 -0700, Eric W. Biederman wrote:

>> If you called it a core dump instead of a checkpoint you have exactly the same
> set
>> of issues.
>
> I completely agree with you that there's a lot of common ground here
> between coredumps and checkpoints.  I'm not aware of any applications
> like, let's say Oracle, that use coredumps in the process of normal
> execution.  Checkpoints must be more scalable and lower overhead than
> coredumps are.

Checkpoints certainly need to be as light weight as we can make them.

Checkpoints as backup of where you are in case the machine crashes
I'm not certain I believe in.  Checkpoints for saving state over
a kernel upgrade or for migrating to a different machine make a lot
of sense to me.

>> Why we are doing vfs_write instead of file->f_op->write I don't understand.
>
> That's an excellent question.  I assume you're asking because at least
> the elf core dump code uses it, right?

Yes.

>> > System calls in Linux are fast.  Doing lots of them is not a problem.
>> > If it becomes one, we can always export a condensed version of this
>> > format next to the expanded one, kinda like ftrace does.  Atomicity with
>> > this approach is also not a problem.  The system call in this approach
>> > doesn't return until the checkpoint is completely written out.
>> 
>> Extra copies for something (memory) you want to transfer quickly
>> and efficiently is a problem.
>
> That's definitely true.  But, as I said, this approach isn't bound to
> copying everything.  We have the flexibility to choose what we do.

With a file descriptor I can push the data onto a network socket and
the receiving process is on another computer.  0 copies, 0 trips
to user space.  I'm not certain how you would achieve that with filesystem
approach.

>> Reading the memory of another process is a problem, to the point
>> that the /proc/<pid>/mem interface has been removed from the kernel.
>
> Yes, this is certainly true.  All of the ptrace-related security issues
> surely tell us something.  But, I'm not sure of your point here.  Are
> you saying that using sys_checkpoint() to dump a process's pages is
> inherently safer than approach that uses a filesystem in order to do the
> same?

I'm saying inspecting another process is a very racy operation so something
we need to be especially careful with. 

>> > Want to do a fast checkpoint?  Fine, copy all data, use a lot of memory,
>> > store it in-kernel.  Dump that out when the filesystem is accessed.
>> > Destroy it when userspace asks.
>> 
>> > So, why not?
>> 
>> Besides the part of creating a bunch of questionable interfaces
>> that we need to support forever.
>> 
>> Ultimately the question is how do you do checkpoint restore and I just
>> don't see that happening with a filesystem interface.  Way way way too many
>> dangerous syscalls that are only needed for one thing.
>
> I completely understand what you're saying here.  But, could you
> distinguish how this differs from the current way that sys_checkpoint()
> does it?  Surely, the checkpoint format is an ABI.  It is a complex ABI
> with many, many constituent structures.  This is an ABI with many, many,
> ways of reading simple data.  Seems like just slicing up the problem
> differently to me.

I was thinking about restore.  Creating objects with a certain id can
easily be a security risk if you are not creating the namespace those
objects live in at the same time.  There is currently the downside
that we can't create namespaces as unprivileged users ( The
implementation of suid is so annoying). But the general concept still
applies, and if we ever get the uid namespace correct we will be able
to create namespaces as unprivileged users.

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]         ` <m1k5bpwj8j.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-10-31 10:22           ` Louis Rilling
  2008-10-31 13:48           ` Serge E. Hallyn
  2008-10-31 14:21           ` Dave Hansen
  2 siblings, 0 replies; 26+ messages in thread
From: Louis Rilling @ 2008-10-31 10:22 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: containers, Dave Hansen


[-- Attachment #1.1: Type: text/plain, Size: 499 bytes --]

On Thu, Oct 30, 2008 at 08:12:28PM -0700, Eric W. Biederman wrote:
> Checkpoints as backup of where you are in case the machine crashes
> I'm not certain I believe in. 

This is actually the main reason why checkpoint has been used in HPC for years,
way before container-based implementations appeared.

Louis

-- 
Dr Louis Rilling			Kerlabs
Skype: louis.rilling			Batiment Germanium
Phone: (+33|0) 6 80 89 08 23		80 avenue des Buttes de Coesmes
http://www.kerlabs.com/			35700 Rennes

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]         ` <m1k5bpwj8j.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  2008-10-31 10:22           ` Louis Rilling
@ 2008-10-31 13:48           ` Serge E. Hallyn
  2008-10-31 14:21           ` Dave Hansen
  2 siblings, 0 replies; 26+ messages in thread
From: Serge E. Hallyn @ 2008-10-31 13:48 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: containers, Dave Hansen

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> With a file descriptor I can push the data onto a network socket and
> the receiving process is on another computer.  0 copies, 0 trips
> to user space.  I'm not certain how you would achieve that with filesystem
> approach.

This has been Oren's most convincing argument for all sorts of little
choices (his precise data format, the use of an fd and cr_kwrite()).

I wonder (a) what neat things Dave could come up with to to bridge that
gap, and (b) how much of that gap becomes less meaningful with a proper
use of pre-dump (and post-dump).

> >> Reading the memory of another process is a problem, to the point
> >> that the /proc/<pid>/mem interface has been removed from the kernel.
> >
> > Yes, this is certainly true.  All of the ptrace-related security issues
> > surely tell us something.  But, I'm not sure of your point here.  Are
> > you saying that using sys_checkpoint() to dump a process's pages is
> > inherently safer than approach that uses a filesystem in order to do the
> > same?
> 
> I'm saying inspecting another process is a very racy operation so something
> we need to be especially careful with. 

I don't see any difference there between Dave's and Oren's approaches.
In either case, the container is frozen while the kernel walks the
container's task's pages and dumps them... somewhere.

-serge

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]         ` <m1k5bpwj8j.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  2008-10-31 10:22           ` Louis Rilling
  2008-10-31 13:48           ` Serge E. Hallyn
@ 2008-10-31 14:21           ` Dave Hansen
  2008-10-31 20:51             ` Eric W. Biederman
  2 siblings, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2008-10-31 14:21 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: containers

On Thu, 2008-10-30 at 20:12 -0700, Eric W. Biederman wrote:
> Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:
> >> > System calls in Linux are fast.  Doing lots of them is not a problem.
> >> > If it becomes one, we can always export a condensed version of this
> >> > format next to the expanded one, kinda like ftrace does.  Atomicity with
> >> > this approach is also not a problem.  The system call in this approach
> >> > doesn't return until the checkpoint is completely written out.
> >> 
> >> Extra copies for something (memory) you want to transfer quickly
> >> and efficiently is a problem.
> >
> > That's definitely true.  But, as I said, this approach isn't bound to
> > copying everything.  We have the flexibility to choose what we do.
> 
> With a file descriptor I can push the data onto a network socket and
> the receiving process is on another computer.  0 copies, 0 trips
> to user space.  I'm not certain how you would achieve that with filesystem
> approach.

for sys_checkpoint() does:
	1. copy from task_struct (or whatever kernel struct) into buffer
	2. run vfs_write() with that buffer and the user fd
	3. fd target reads from that buffer

The fs approach would:
	1. user calls read()
	2. fs fills data in directly into *userspace* buffer
	3. user does sendfile, etc...

See?  sys_checkpoint() *does* a copy.  It just does it into a kernel
buffer.  That's why we need to call vfs_write().

> I'm saying inspecting another process is a very racy operation so something
> we need to be especially careful with. 

No disagreement from me on that one.  

> >> Ultimately the question is how do you do checkpoint restore and I just
> >> don't see that happening with a filesystem interface.  Way way way too many
> >> dangerous syscalls that are only needed for one thing.
> >
> > I completely understand what you're saying here.  But, could you
> > distinguish how this differs from the current way that sys_checkpoint()
> > does it?  Surely, the checkpoint format is an ABI.  It is a complex ABI
> > with many, many constituent structures.  This is an ABI with many, many,
> > ways of reading simple data.  Seems like just slicing up the problem
> > differently to me.
> 
> I was thinking about restore.  Creating objects with a certain id can
> easily be a security risk if you are not creating the namespace those
> objects live in at the same time.  There is currently the downside
> that we can't create namespaces as unprivileged users ( The
> implementation of suid is so annoying). But the general concept still
> applies, and if we ever get the uid namespace correct we will be able
> to create namespaces as unprivileged users.

Eric, you were saying that my interface had way too many "dangerous
syscalls".  How does this relate to user namespaces and creating objects
with particular ids?  Surely if the true problem with my suggested
approach has to do with creating empty namespaces, the same problem
exists with the sys_checkpoint() approach.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
  2008-10-31 14:21           ` Dave Hansen
@ 2008-10-31 20:51             ` Eric W. Biederman
       [not found]               ` <m1r65wpjx2.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  0 siblings, 1 reply; 26+ messages in thread
From: Eric W. Biederman @ 2008-10-31 20:51 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers

Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:

>> I was thinking about restore.  Creating objects with a certain id can
>> easily be a security risk if you are not creating the namespace those
>> objects live in at the same time.  There is currently the downside
>> that we can't create namespaces as unprivileged users ( The
>> implementation of suid is so annoying). But the general concept still
>> applies, and if we ever get the uid namespace correct we will be able
>> to create namespaces as unprivileged users.
>
> Eric, you were saying that my interface had way too many "dangerous
> syscalls".  How does this relate to user namespaces and creating objects
> with particular ids?  Surely if the true problem with my suggested
> approach has to do with creating empty namespaces, the same problem
> exists with the sys_checkpoint() approach.

Ok. Some concrete examples to put this in context.

First the class of problem I am talking about is the classic unix temp file
security hole.

A specific example is fork_and_set_child_pid();

Suppose there is a important system daemon that dies and it's pid is 23.
It dies and doesn't delete it's pid file.
A malicious user notices this and does for_and_set_child_pid(23);
Later someone checks to see if the important system daemon is running,
sees a process at pid 23, and so does not restart it.
A DOS attack.

In a sys_restore() scenario at the very start you can check to make
certain that the reference count for the namespaces is 1 and that they
are empty.  Which means there is no chance of confusing user space.

With fork_and_set_child_pid() what is a simple cheap one time check
becomes an expensive painful one, if you can even implement it at all.

The difference is that with a bunch of small pieces you loose atomicity. 

Eric

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
       [not found]               ` <m1r65wpjx2.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-11-03 17:23                 ` Dave Hansen
  2008-11-03 17:48                   ` Dave Hansen
  0 siblings, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2008-11-03 17:23 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: containers

On Fri, 2008-10-31 at 13:51 -0700, Eric W. Biederman wrote:
> Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> writes:
> > Eric, you were saying that my interface had way too many "dangerous
> > syscalls".  How does this relate to user namespaces and creating objects
> > with particular ids?  Surely if the true problem with my suggested
> > approach has to do with creating empty namespaces, the same problem
> > exists with the sys_checkpoint() approach.
...
> In a sys_restore() scenario at the very start you can check to make
> certain that the reference count for the namespaces is 1 and that they
> are empty.  Which means there is no chance of confusing user space.
> 
> With fork_and_set_child_pid() what is a simple cheap one time check
> becomes an expensive painful one, if you can even implement it at all.
> 
> The difference is that with a bunch of small pieces you loose atomicity. 

I think we're just trading trade-offs here. :)

I believe your suggestion is simply to constrain the problem.  If we put
extra restrictions on sys_restart() to ensure that its job is simpler
then some of the implementation problems just go away.  That's
definitely a good approach.

In this case you are saying that, during a call to sys_restart(), we
should ensure that the task doing the restoring holds the only reference
to those namespaces.  If it does, that means that there can't possibly
be any security implications because no one else can possibly even *see*
those namespaces.  This is a laudable goal, but I'm not sure it works in
practice without more code.

The problem is that we can't possibly use refcounts (at least the ones
we have today) alone.  For instance, with the pid namespace, we would
have 1 ref for the 'init' process doing the sys_restore() call and then
a possible second refcount for /proc.  Perhaps we could differentiate
references to namespaces that instantiate objects inside the namespaces
from purely references to the namespace *itself*.

Rather than offering a solution for the filesystem-based approach, I'll
venture this: whatever I come up with will be extra code to glue things
back together, to detect when namespaces are "fresh" and able to be
scribbled into.  

Anyway, it's obvious that you and Oren don't like my approach just as
much as I don't like the syscalls.  So, I'll just drop it for now.  But,
please do keep it in the back of your minds in case it applies
somewhere.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
  2008-11-03 17:23                 ` Dave Hansen
@ 2008-11-03 17:48                   ` Dave Hansen
  0 siblings, 0 replies; 26+ messages in thread
From: Dave Hansen @ 2008-11-03 17:48 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: containers

On Mon, 2008-11-03 at 09:23 -0800, Dave Hansen wrote:
> The problem is that we can't possibly use refcounts (at least the ones
> we have today) alone.  For instance, with the pid namespace, we would
> have 1 ref for the 'init' process doing the sys_restore() call and then
> a possible second refcount for /proc.  Perhaps we could differentiate
> references to namespaces that instantiate objects inside the namespaces
> from purely references to the namespace *itself*.

By this, I mean something along the lines of 'mm_struct's mm_users vs.
mm_count.

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BIG RFC] Filesystem-based checkpoint
  2008-10-30 20:11             ` Dave Hansen
@ 2008-11-04 21:33               ` Mike Waychison
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Waychison @ 2008-11-04 21:33 UTC (permalink / raw)
  To: Dave Hansen; +Cc: containers

Dave Hansen wrote:
> On Thu, 2008-10-30 at 15:47 -0400, Oren Laadan wrote:
>>   3. Your approach doesn't play well with what I call "checkpoint that
>>   involves self". This term refers to a process that checkpoints itself
>>   (and only itself), or to a process that attempts to checkpoint its own
>>   container.  In both cases, there is no other entity that will read the
>>   data from the file system while the caller is blocked.
>>
>> This is a key point for me, with multiple use cases. The simplest, if
>> you will, is for a process to simply checkpoint itself (no containers
>> and other crap :p). Same for dumping your own container. And there are
>> others.
> 
> Let's take a step back here.  I believe that strictly enforcing this
> requirement strictly requires that the checkpoint be done in its
> entirety by the kernel.
> 
> A process must have its state serialized in a repeatable way.  That
> basically precludes  it running during the checkpoint, or having its
> state change in any way that isn't atomic.
> 
> If a process can't be, itself, running during a checkpoint, then
> something running must be performing the checkpoint.  That "something"
> must either be another process or the kernel.  Since you've defined the
> goal as a self-checkpoint, it *can't* be another process.  So, it *must*
> be the kernel.

Stepping in a little late into the conversation here, but a 
self-checkpoint to me means "initiated by self".  It doesn't preclude a 
userland service (outside our container or whatever) performing the 
grunge work for us once asked.

> 
> When it comes down to it, I think this point drives quite a bit of the
> implementation.  The cr_kread/write(), for instance.  We *need* the
> kernel to do the writing since we've completely precluded userspace from
> doing it.
> 
> -- Dave
> 
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-11-04 21:33 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-28 18:37 [BIG RFC] Filesystem-based checkpoint Dave Hansen
2008-10-28 20:56 ` Serge E. Hallyn
     [not found]   ` <20081028205654.GA17487-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-10-28 21:00     ` Dave Hansen
2008-10-28 21:10     ` Dave Hansen
2008-10-30 16:25       ` Oren Laadan
     [not found]         ` <4909E000.9070201-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-10-30 16:36           ` Dave Hansen
2008-10-30 18:19 ` Oren Laadan
     [not found]   ` <4909FAA8.5000107-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-10-30 19:28     ` Serge E. Hallyn
     [not found]       ` <20081030192817.GA16340-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-10-30 19:39         ` Dave Hansen
2008-10-30 19:50           ` Serge E. Hallyn
2008-10-30 19:47         ` Oren Laadan
     [not found]           ` <490A0F67.5000303-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-10-30 20:03             ` Serge E. Hallyn
2008-10-30 20:11             ` Dave Hansen
2008-11-04 21:33               ` Mike Waychison
2008-10-30 19:37     ` Dave Hansen
2008-10-30 20:15       ` Oren Laadan
     [not found]         ` <490A15F5.6010702-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-10-30 20:40           ` Dave Hansen
2008-10-30 23:33 ` Eric W. Biederman
     [not found]   ` <m163n9y7yb.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-10-31  0:09     ` Dave Hansen
2008-10-31  3:12       ` Eric W. Biederman
     [not found]         ` <m1k5bpwj8j.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-10-31 10:22           ` Louis Rilling
2008-10-31 13:48           ` Serge E. Hallyn
2008-10-31 14:21           ` Dave Hansen
2008-10-31 20:51             ` Eric W. Biederman
     [not found]               ` <m1r65wpjx2.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-11-03 17:23                 ` Dave Hansen
2008-11-03 17:48                   ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.