From: Oren Laadan <orenl@cs.columbia.edu>
To: Dave Hansen <dave@sr71.net>
Cc: containers@lists.linux-foundation.org,
Jeremy Fitzhardinge <jeremy@goop.org>,
linux-kernel@vger.kernel.org, arnd@arndb.de
Subject: Re: [RFC v2][PATCH 2/9] General infrastructure for checkpoint restart
Date: Sun, 24 Aug 2008 22:47:24 -0400 [thread overview]
Message-ID: <48B21D3C.8090601@cs.columbia.edu> (raw)
In-Reply-To: <1219435302.20559.121.camel@nimitz>
Dave Hansen wrote:
> On Wed, 2008-08-20 at 23:04 -0400, Oren Laadan wrote:
>> diff --git a/checkpoint/checkpoint.c b/checkpoint/checkpoint.c
>> new file mode 100644
>> index 0000000..25343f5
>> --- /dev/null
>> +++ b/checkpoint/checkpoint.c
[...]
>> +/* write the checkpoint header */
>> +static int cr_write_hdr(struct cr_ctx *ctx)
>> +{
>> + struct cr_hdr h;
>> + struct cr_hdr_head *hh = ctx->hbuf;
>> + struct new_utsname *uts;
>> + struct timeval ktv;
>> +
>> + h.type = CR_HDR_HEAD;
>> + h.len = sizeof(*hh);
>> + h.ptag = 0;
>> +
>> + do_gettimeofday(&ktv);
>> +
>> + hh->magic = CR_MAGIC_HEAD;
>> + hh->major = (LINUX_VERSION_CODE >> 16) & 0xff;
>> + hh->minor = (LINUX_VERSION_CODE >> 8) & 0xff;
>> + hh->patch = (LINUX_VERSION_CODE) & 0xff;
>
> We at least neex EXTRAVERSION in there if we're going to even pretend
> this is useful, right? is this redundant with utsname() below?
It isn't clear exactly what we need in the header in terms of version, config,
and compile information. I put this as a starter, because it's obvious and
because this info is likely to be useful even at the user/admin level. I don't
expect this to remain as is as we continue development.
[...]
>> +struct cr_ctx {
>> + pid_t pid; /* container identifier */
>> + int crid; /* unique checkpoint id */
>> +
>> + unsigned long flags;
>> + unsigned long oflags; /* restart: old flags */
>> +
>> + struct file *file;
>> + int total; /* total read/written */
>> +
>> + void *tbuf; /* temp: to avoid many alloc/dealloc */
>> + void *hbuf; /* header: to avoid many alloc/dealloc */
>> + int hpos;
>
> alloc/delloc is not a bad thing. :) I think at least a few of these
> should be moved over to stack or kmalloc()/kfree().
Definitely the comment is misleading. See comment below.
>
>> + struct path *vfsroot; /* container root */
>> +};
>
> Can a container only have one vfsroot?
This is a temporary solution since this patchset doesn't really handle real
containers yet. Will add a FIXME comment.
>
>> +
>> +/* cr_ctx: flags */
>> +#define CR_CTX_CKPT 0x1
>> +#define CR_CTX_RSTR 0x2
>
> enum?
These are flags (more will come later).
[...]
>> +struct cr_hdr_task {
>> + __u64 state;
>> + __u32 exit_state;
>> + __u32 exit_code, exit_signal;
>> +
>> + __u64 utime, stime, utimescaled, stimescaled;
>> + __u64 gtime;
>> + __u64 prev_utime, prev_stime;
>> + __u64 nvcsw, nivcsw;
>> + __u64 start_time_sec, start_time_nsec;
>> + __u64 real_start_time_sec, real_start_time_nsec;
>> + __u64 min_flt, maj_flt;
>> +
>> + __s16 task_comm_len;
>> + char comm[TASK_COMM_LEN];
>> +} __attribute__ ((aligned (8)));
>
> TASK_COMM_LEN might be subject to changing. Can it be part of the
> user<->kernel ABI?
Good point. Having the 'task_comm_len' field there isn't enough because the
size of the entire struct changes anyways.
[...]
>
> ...
>> +/*
>> + * During restart the code reads in data from the chekcpoint image into a
>> + * temporary buffer (ctx->hbuf). Because operations can be nested, one
>> + * should call cr_hbuf_get() to reserve space in the buffer, and then
>> + * cr_hbuf_put() when it no longer needs that space
>> + */
>> +
>> +#include <linux/version.h>
>> +#include <linux/sched.h>
>> +#include <linux/file.h>
>> +
>> +#include "ckpt.h"
>> +#include "ckpt_hdr.h"
>> +
>> +/**
>> + * cr_hbuf_get - reserve space on the hbuf
>> + * @ctx: checkpoint context
>> + * @n: number of bytes to reserve
>> + */
>> +void *cr_hbuf_get(struct cr_ctx *ctx, int n)
>> +{
>> + void *ptr;
>> +
>> + BUG_ON(ctx->hpos + n > CR_HBUF_TOTAL);
>> + ptr = (void *) (((char *) ctx->hbuf) + ctx->hpos);
>> + ctx->hpos += n;
>> + return ptr;
>> +}
>> +
>> +/**
>> + * cr_hbuf_put - unreserve space on the hbuf
>> + * @ctx: checkpoint context
>> + * @n: number of bytes to reserve
>> + */
>> +void cr_hbuf_put(struct cr_ctx *ctx, int n)
>> +{
>> + BUG_ON(ctx->hpos < n);
>> + ctx->hpos -= n;
>> +}
>
> I replaced all of the uses of these with kmalloc()/kfree() and stack
> allocations. I'm really not sure what these buy us since our allocators
> are already so fast. tbuf, especially, worries me if it gets used in
> any kind of nested manner we're going to get some really fun bugs.
I disagree with putting some of the cr_hdr_... on the stack: first, if they
grow in size, it's easy to forget to change the allocation to stack. Second,
using a standard code/handling for all cr_hdr_... makes the code more readable
and easier for others to extend by simply following existing code.
The main argument for ->hbuf is that eventually we would like to buffer
data in the kernel to avoid potentially slow writing/reading when processes
are frozen during checkpoint/restart. By using the simple cr_get_hbuf() and
cr_put_hbuf() we prepare for that: now they just allocate room in ->hbuf,
but later they will provide a pointer directly in the data buffer. Moreover,
it simplifies the error path since cleanup (of ->hbuf) is implicit. Also,
it is designed to allow checkpoint and restart function to be called in a
nested manner, again simplifying the code. Finally, my experience was that
it can impact performance if you are after very short downtimes, especially
for the checkpoint.
The use of ->tbuf is less critical and is mainly for performance and to
simplify the error path, never to be called nested. I'm ok with removing
it for now.
[...]
>> +/* read the checkpoint header */
>> +static int cr_read_hdr(struct cr_ctx *ctx)
>> +{
>> + struct cr_hdr_head *hh = cr_hbuf_get(ctx, sizeof(*hh));
>> + int ret;
>> +
>> + ret = cr_read_obj_type(ctx, hh, sizeof(*hh), CR_HDR_HEAD);
>> + if (ret < 0)
>> + return ret;
>> + else if (ret != 0)
>> + return -EINVAL;
>
> How about just making cr_read_obj_type() stop returning positive values?
> Is it ever valid for it to return >0?
The idea with cr_read_obj_type() is that it returns the 'ptag' - parent tag -
field of the object that it reads in. The 'ptag' is the tag of the parent
object, and is useful in several places. For instance, the 'ptag' of an MM
header identifies (is equal to) the 'tag' of TASK to which it belongs. In
this case, the 'ptag' should be zero because it has no parent object.
I'll change the variable name from 'ret' to 'ptag' to make it more obvious.
[...]
>
> -- Dave
>
Oren.
next prev parent reply other threads:[~2008-08-25 2:47 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-08-21 2:58 [RFC v2][PATCH 1/9] kernel based checkpoint-restart Oren Laadan
2008-08-21 3:03 ` [RFC v2][PATCH 1/9] Create trivial sys_checkpoint/sys_restart syscalls Oren Laadan
2008-08-21 5:17 ` Oren Laadan
2008-08-22 19:32 ` Dave Hansen
2008-08-22 20:11 ` Dave Hansen
2008-08-22 21:20 ` Oren Laadan
2008-08-21 3:04 ` [RFC v2][PATCH 2/9] General infrastructure for checkpoint restart Oren Laadan
2008-08-21 9:35 ` Louis Rilling
2008-08-24 5:58 ` Oren Laadan
2008-08-22 20:01 ` Dave Hansen
2008-08-25 2:47 ` Oren Laadan [this message]
2008-08-26 16:42 ` Dave Hansen
2008-08-27 0:38 ` Oren Laadan
2008-08-21 3:04 ` [RFC v2][PATCH 3/9] x86 support for checkpoint/restart Oren Laadan
2008-08-22 20:17 ` Dave Hansen
2008-08-21 3:05 ` [RFC v2][PATCH 4/9] Memory management - dump state Oren Laadan
2008-08-21 7:30 ` Ingo Molnar
2008-08-21 8:01 ` Justin P. Mattock
2008-08-21 10:28 ` Balbir Singh
2008-08-21 11:59 ` Ingo Molnar
2008-08-21 9:53 ` Louis Rilling
2008-08-22 21:21 ` Oren Laadan
2008-08-22 20:37 ` Dave Hansen
2008-08-24 5:40 ` Oren Laadan
2008-08-26 16:33 ` Dave Hansen
2008-08-27 0:14 ` Oren Laadan
2008-08-27 15:41 ` Dave Hansen
2008-08-27 15:57 ` Louis Rilling
2008-08-27 16:12 ` Dave Hansen
2008-08-27 16:19 ` Jeremy Fitzhardinge
2008-08-27 20:34 ` Serge E. Hallyn
2008-08-27 20:38 ` Dave Hansen
2008-08-27 20:48 ` Serge E. Hallyn
2008-08-27 20:56 ` Dave Hansen
2008-08-31 7:16 ` Oren Laadan
2008-08-31 17:34 ` Cedric Le Goater
2008-09-03 11:43 ` [Devel] " Andrey Mirkin
2008-09-03 12:15 ` Cedric Le Goater
2008-09-03 13:29 ` Andrey Mirkin
2008-09-02 15:32 ` Serge E. Hallyn
2008-08-21 3:05 ` [RFC v2][PATCH 5/9] Memory managemnet - restore state Oren Laadan
2008-08-21 10:07 ` Louis Rilling
2008-08-21 3:06 ` [RFC v2][PATCH 6/9] Checkpoint/restart: initial documentation Oren Laadan
2008-08-21 3:06 ` [RFC v2][PATCH 7/9] Infrastructure for shared objects Oren Laadan
2008-08-21 10:40 ` Louis Rilling
2008-08-26 17:01 ` Dave Hansen
2008-08-27 8:26 ` Louis Rilling
2008-08-21 3:07 ` [RFC v2][PATCH 8/9] File descriprtors - dump state Oren Laadan
2008-08-21 11:06 ` Louis Rilling
2008-08-25 3:28 ` Oren Laadan
2008-08-25 10:30 ` Louis Rilling
2008-08-21 3:07 ` [RFC v2][PATCH 9/9] File descriprtors (restore) Oren Laadan
2008-08-21 5:26 ` Oren Laadan
2008-08-21 5:15 ` [RFC v2][PATCH 1/9] kernel based checkpoint-restart Oren Laadan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=48B21D3C.8090601@cs.columbia.edu \
--to=orenl@cs.columbia.edu \
--cc=arnd@arndb.de \
--cc=containers@lists.linux-foundation.org \
--cc=dave@sr71.net \
--cc=jeremy@goop.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox