All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
To: Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org
Cc: Linux Containers
	<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Subject: Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps
Date: Thu, 31 Jul 2008 11:09:54 -0400	[thread overview]
Message-ID: <4891D5C2.8090000@cs.columbia.edu> (raw)
In-Reply-To: <20080731135703.GC22403-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>



Louis Rilling wrote:
> On Wed, Jul 30, 2008 at 06:20:32PM -0400, Oren Laadan wrote:
>>
>> Serge E. Hallyn wrote:
>>> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
>>>> +int do_checkpoint(struct cr_ctx *ctx)
>>>> +{
>>>> +	int ret;
>>>> +
>>>> +	/* FIX: need to test whether container is checkpointable */
>>>> +
>>>> +	ret = cr_write_hdr(ctx);
>>>> +	if (!ret)
>>>> +		ret = cr_write_task(ctx, current);
>>>> +	if (!ret)
>>>> +		ret = cr_write_tail(ctx);
>>>> +
>>>> +	/* on success, return (unique) checkpoint identifier */
>>>> +	if (!ret)
>>>> +		ret = ctx->crid;
>>> Does this crid have a purpose?
>> yes, at least three; both are for the future, but important to set the
>> meaning of the return value of the syscall already now. The "crid" is
>> the CR-identifier that identifies the checkpoint. Every checkpoint is
>> assigned a unique number (using an atomic counter).
>>
>> 1) if a checkpoint is taken and kept in memory (instead of to a file) then
>> this will be the identifier with which the restart (or cleanup) would refer
>> to the (in memory) checkpoint image
>>
>> 2) to reduce downtime of the checkpoint, data will be aggregated on the
>> checkpoint context, as well as referenced to (cow-ed) pages. This data can
>> persist between calls to sys_checkpoint(), and the 'crid', again, will be
>> used to identify the (in-memory-to-be-dumped-to-storage) context.
>>
>> 3) for incremental checkpoint (where a successive checkpoint will only
>> save what has changed since the previous checkpoint) there will be a need
>> to identify the previous checkpoints (to be able to know where to take
>> data from during restart). Again, a 'crid' is handy.
>>
>> [in fact, for the 3rd use, it will make sense to write that number as
>> part of the checkpoint image header]
>>
>> Note that by doing so, a process that checkpoints itself (in its own
>> context), can use code that is similar to the logic of fork():
>>
>> 	...
>> 	crid = checkpoint(...);
>> 	switch (crid) {
>> 	case -1:
>> 		perror("checkpoint failed");
>> 		break;
>> 	default:
>> 		fprintf(stderr, "checkpoint succeeded, CRID=%d\n", ret);
>> 		/* proceed with execution after checkpoint */
>> 		...
>> 		break;
>> 	case 0:
>> 		fprintf(stderr, "returned after restart\n");
>> 		/* proceed with action required following a restart */
>> 		...
>> 		break;
>> 	}
>> 	...
> 
> If I understand correctly, this crid can live for quite a long time. So many of
> them could be generated while some container would accumulate incremental
> checkpoints on, say crid 5, and possibly crid 5 could be reused for another
> unrelated checkpoint during that time. This brings the issue of allocating crids
> reliably (using something like a pidmap for instance). Moreover, if such ids are
> exposed to userspace, we need to remember which ones are allocated accross
> reboots and migrations.
> 
> I'm afraid that this becomes too complex...

And I'm afraid I didn't explain myself well. So let me rephrase:

CRIDs are always _local_ to a specific node. The local CRID counter is
bumped (atomically) with each checkpoint attempt. The main use case is
for when the checkpoint is kept is memory either shortly (until it is
written back to disk) or for a longer time (use-cases that want to keep
it there). It only remains valid as long as the checkpoint image is
still in memory and have not been committed to storage/network. Think
of it as a way to identify the operation instance.

So they can live quite a long time, but only as long as the original
node is still alive and the checkpoint is still kept in memory. They
are meaningless across reboots and migrations. I don't think a wrap
around is a concern, but we can use 64 bit if that is the case.

Finally, the incremental checkpoint use-case: imagine a container that
is checkpointed regularly every minutes. The first checkpoint will be
a full checkpoint, say CRID=1. The second will be incremental with
respect to the first, with CRID=2, and so on the third and the forth.
Userspace could use these CRID to name the image files (for example,
app.img.CRID). Assume that we decide (big "if") that the convention is
that the last part of the filename must be the CRID, and if we decide
(another big "if") to save the CRID as part of the checkpoint image --
the part that describe the "incremental nature" of a new checkpoint.
(That part would specify where to get state that wasn't really saved
in the new checkpoint but instead can be retrieved from older ones).
If that was the case, then the logic in the kernel would be fairly
to find (and access) the actual files that hold the data. Note, that
in this case - the CRID are guaranteed to be unique per series of
incremental checkpoints, and incremental chekcpoint is meaningless
across reboots (and we can require that across migration too).

We probably don't want to use something like a pid to identify the
checkpoint (while in memory), because we may have multiple checkpoints
in memory at a time (of the same container).

> 
> It would be way easier if the only (kernel-level) references to a checkpoint
> were pointers to its context. Ideally, the only reference would live in a
> 'struct container' and would be easily updated at restart-time.

Consider the following scenario of calls from user-space (which is
how I envision the checkpoint optimized for minimal downtime, in the
future):

1)	while (syscall_to_do_precopy)		<- do precopy until ready to
		if (too_long_already)		<- checkpoint or too long
			break;

2)	freeze_container();

3)	crid = checkpoint(.., .., CR_CKPT_LAZY);	<- checkpoint container
							<- don't commit to disk
							<- (minimize owntime)

4)	unfreeze_container();			<- now can unfreeze container
						<- already as soon as possible

5)	ckpt_writeback(crid, fd);		<- container is back running. we
						<- can commit data to storage or
						<- network in the background.

#2 and #4 are done with freezer_cgroup()

#1, #3 and #5 must be syscalls

More specifically, syscall #5 must be able to refer to the result of syscall #3
(that is the CRID !). It is possible that another syscall #3 occur, on the same
container, between steps 4 and 5 ... but then that checkpoint will be assigned
another, unique CRID.

> My $0.02 ...

Thanks... American or Canadian ?  ;)

Oren.

> 
> Louis
> 

  parent reply	other threads:[~2008-07-31 15:09 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-30  3:27 [RFC][PATCH 2/2] CR: handle a single task with private memory maps Oren Laadan
     [not found] ` <Pine.LNX.4.64.0807292325290.9868-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
2008-07-30  4:51   ` KOSAKI Motohiro
     [not found]     ` <20080730132257.9DF2.KOSAKI.MOTOHIRO-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2008-07-30 18:22       ` Oren Laadan
2008-07-30 20:58   ` Dave Hansen
2008-07-30 22:07   ` Serge E. Hallyn
     [not found]     ` <20080730220752.GA3518-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-07-30 22:20       ` Oren Laadan
     [not found]         ` <4890E930.9090204-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-07-31 13:57           ` Louis Rilling
     [not found]             ` <20080731135703.GC22403-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2008-07-31 15:09               ` Oren Laadan [this message]
     [not found]                 ` <4891D5C2.8090000-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-07-31 15:58                   ` Louis Rilling
     [not found]                     ` <20080731155856.GH22403-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2008-07-31 16:28                       ` Oren Laadan
     [not found]                         ` <4891E849.1050701-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-07-31 17:50                           ` Louis Rilling
     [not found]                             ` <20080731175058.GI22403-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2008-07-31 19:12                               ` Oren Laadan
     [not found]                                 ` <48920EA0.1060608-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-01 10:26                                   ` Louis Rilling
     [not found]                                     ` <20080801102600.GJ22403-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2008-08-01 14:15                                       ` Oren Laadan
     [not found]                                         ` <48931A7E.1040302-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-01 18:00                                           ` Louis Rilling
     [not found]                                             ` <20080801180038.GL22403-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2008-08-01 18:51                                               ` Oren Laadan
     [not found]                                                 ` <48935B4D.7070302-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-04 10:16                                                   ` Louis Rilling
2008-08-05  2:37                                                     ` Oren Laadan
     [not found]                                                       ` <4897BCE0.1080508-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-05  3:51                                                         ` Joseph Ruscio
     [not found]                                                           ` <1FA56146-7C30-4C36-982D-A50AA8BC8392-ccALPSaRSA5Wk0Htik3J/w@public.gmane.org>
2008-08-05  9:19                                                             ` Louis Rilling
2008-08-05 16:20                                                               ` Oren Laadan
     [not found]                                                                 ` <48987DE7.3060408-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-08-06 15:41                                                                   ` Joseph Ruscio
     [not found]                                                                     ` <3A99F254-E9B3-484B-85B0-29023ADA04C4-ccALPSaRSA5Wk0Htik3J/w@public.gmane.org>
2008-08-07  9:25                                                                       ` Louis Rilling
2008-08-05 16:23                                                             ` Dave Hansen
2008-08-06 16:15                                                               ` Joseph Ruscio
     [not found]                                                                 ` <FE4D936E-06F1-45D2-8E7C-85D87149BDC0-ccALPSaRSA5Wk0Htik3J/w@public.gmane.org>
2008-08-07  9:29                                                                   ` Louis Rilling
2008-08-08 17:20                                                               ` Joseph Ruscio
     [not found]                                                                 ` <03CE5BD3-E84A-4617-93BC-722ECB846C63-ccALPSaRSA5Wk0Htik3J/w@public.gmane.org>
2008-08-08 17:24                                                                   ` Dave Hansen
2008-08-05  9:32                                                         ` Louis Rilling
2008-07-31 21:25           ` Serge E. Hallyn
     [not found] ` <20080730161535.GB22403@hawkmoon.kerlabs.com>
     [not found]   ` <20080730161535.GB22403-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2008-07-30 18:27     ` Oren Laadan
     [not found]       ` <4890B2A8.8010808-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-07-31 14:08         ` Louis Rilling
     [not found]           ` <20080731140844.GE22403-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2008-07-31 14:44             ` Oren Laadan
  -- strict thread matches above, loose matches on Subject: below --
2008-07-30 16:52 Serge E. Hallyn
     [not found] ` <20080730165249.GA23802-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-07-30 17:40   ` Dave Hansen
2008-07-31 13:59     ` Louis Rilling
     [not found]       ` <20080731135910.GD22403-Hu8+6S1rdjywhHL9vcZdMVaTQe2KTcn/@public.gmane.org>
2008-07-31 14:14         ` Serge E. Hallyn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4891D5C2.8090000@cs.columbia.edu \
    --to=orenl-eqauephvms7envbuuze7ea@public.gmane.org \
    --cc=Louis.Rilling-aw0BnHfMbSpBDgjK7y7TUQ@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.