All of lore.kernel.org
 help / color / mirror / Atom feed
From: Oren Laadan <orenl@cs.columbia.edu>
To: Gene Cooperman <gene@ccs.neu.edu>
Cc: "Luck, Tony" <tony.luck@intel.com>,
	Kapil Arya <kapil@ccs.neu.edu>,
	"ksummit-2010-discuss@lists.linux-foundation.org" 
	<ksummit-2010-discuss@lists.linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [Ksummit-2010-discuss] checkpoint-restart: naked patch
Date: Sat, 06 Nov 2010 17:00:09 -0400	[thread overview]
Message-ID: <4CD5C1D9.7050509@cs.columbia.edu> (raw)
In-Reply-To: <20101105171703.GA1760@sundance.ccs.neu.edu>



On 11/05/2010 01:17 PM, Gene Cooperman wrote:
> On Fri, Nov 05, 2010 at 04:57:33AM -0700, Luck, Tony wrote:
>>> Oren noted that sometimes it's important to stop the process only
>>> for a few milliseconds while one checkpoints. In DMTCP, we do that
>>> by configuring with --enable-forked-checkpointing. This causes us
>>> to fork a child process taking advantage of copy-on-write and then
>>> checkpoint the memory pages of the child while the parent continues
>>> to execute.
>>
>> Interesting ... but while the process is only stopped for the duration
>> of the fork, it may be taking COW faults on almost every page it
>> touches.  I think this will not work well for large HPC applications
>> that allocate most of physical memory as anonymous pages for the
>> application. It may even result in an OOM kill if you don't complete
>> the checkpoint of the child and have it exit in a timely manner.
>>
>> -Tony
>>
> 
> I agree with you that forked checkpointing is probably not what you
> want in the middle of an HPC computation.  But isn't that part of
> the nature of COW?  Whether the COW is invoked within the kernel,
> or from outside the kernel via fork --- in either case, when you have
> mostly dirty pages, you will have to copy most of the pages.
> Do I understand your point correctly?			Thanks,
> 							- Gene

COW is one way of reducing down time (whether through fork or
in-kernel checkpoint). However, it is possible to avoid using
it (and thus avoid extra page faults and memory overload) by
using the page-table "dirty" bit to track dirty pages. This way
one can "pre-copy" the checkpoint image while the application is
running, without additional overhead (the idea is similar to how
live-migration is done).

Oren.

  parent reply	other threads:[~2010-11-06 21:00 UTC|newest]

Thread overview: 123+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <Pine.LNX.4.64.1011021530470.12128@takamine.ncl.cs.columbia.edu>
2010-11-02 21:35 ` [Ksummit-2010-discuss] checkpoint-restart: naked patch Tejun Heo
2010-11-02 21:47   ` Christoph Hellwig
2010-11-04  1:47     ` Nathan Lynch
2010-11-04  7:36       ` Tejun Heo
2010-11-04 16:04         ` Gene Cooperman
2010-11-04 20:45         ` Nathan Lynch
2010-11-06  6:48           ` Matt Helsley
2010-11-04  4:34     ` Oren Laadan
2010-11-04 14:25       ` Christoph Hellwig
2010-11-04  3:40   ` Kapil Arya
2010-11-04  8:05     ` Tejun Heo
2010-11-04 16:44       ` Gene Cooperman
2010-11-05  9:28         ` Tejun Heo
2010-11-05 23:18           ` Oren Laadan
2010-11-06 10:13             ` Tejun Heo
2010-11-06  0:36           ` Kapil Arya
2010-11-06 22:55             ` Oren Laadan
2010-11-07 19:42               ` Gene Cooperman
2010-11-07 21:30                 ` Oren Laadan
2010-11-07 23:05                   ` Gene Cooperman
2010-11-08  3:55                     ` Oren Laadan
2010-11-08 16:26                       ` Gene Cooperman
2010-11-08 18:14                         ` Oren Laadan
2010-11-08 18:37                           ` Gene Cooperman
2010-11-08 19:34                             ` Oren Laadan
2010-11-08 19:05                         ` Dan Smith
2010-11-17 11:14                           ` Tejun Heo
2010-11-17 15:33                             ` Dan Smith
2010-11-17 15:40                               ` Tejun Heo
2010-11-17 17:04                                 ` Alexey Dobriyan
2010-11-17 10:45             ` Tejun Heo
2010-11-17 12:12               ` Tejun Heo
2010-11-06  5:32           ` Matt Helsley
2010-11-06 15:01             ` Oren Laadan
2010-11-06 20:40             ` Gene Cooperman
2010-11-06 22:41               ` Oren Laadan
2010-11-07 18:49                 ` Gene Cooperman
     [not found]                   ` <20101107184927.GF31077-Rl5vdzG4YPwx/1z6v04GWfZ8FUJU4vz8@public.gmane.org>
2010-11-07 21:59                     ` Oren Laadan
2010-11-07 21:59                       ` Oren Laadan
2010-11-17 11:57                       ` Tejun Heo
2010-11-17 15:39                         ` Serge E. Hallyn
2010-11-17 15:46                           ` Tejun Heo
2010-11-18  9:13                             ` Pavel Emelyanov
     [not found]                               ` <4CE4EE21.6050305-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-11-18  9:48                                 ` Tejun Heo
2010-11-18  9:48                                   ` Tejun Heo
2010-11-18 20:13                                   ` Jose R. Santos
2010-11-19  3:54                                   ` Serge Hallyn
2010-11-18 19:53                             ` Oren Laadan
2010-11-19  4:10                             ` Serge Hallyn
2010-11-19 14:04                               ` Tejun Heo
2010-11-20 18:05                                 ` Oren Laadan
     [not found]                                 ` <4CE683E1.6010500-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2010-11-19 14:36                                   ` Kirill Korotaev
2010-11-19 14:36                                     ` Kirill Korotaev
     [not found]                                     ` <04F4899E-B5C7-4BAF-8F2F-05D507A91408-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2010-11-19 15:33                                       ` Tejun Heo
2010-11-19 15:33                                         ` Tejun Heo
2010-11-19 16:00                                         ` Alexey Dobriyan
2010-11-19 16:01                                           ` Alexey Dobriyan
2010-11-19 16:10                                             ` Tejun Heo
2010-11-19 16:25                                               ` Alexey Dobriyan
2010-11-19 16:06                                           ` Tejun Heo
2010-11-19 16:16                                             ` Alexey Dobriyan
2010-11-19 16:19                                               ` Tejun Heo
2010-11-19 16:27                                                 ` Alexey Dobriyan
     [not found]                                                   ` <AANLkTin7kd3crS+fTLLea5PhAii7B3dz=n7p7YtQ6d4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-11-19 16:32                                                     ` Tejun Heo
2010-11-19 16:32                                                       ` Tejun Heo
2010-11-19 16:38                                                       ` Alexey Dobriyan
2010-11-19 16:50                                                         ` Tejun Heo
2010-11-19 16:50                                                           ` Tejun Heo
2010-11-19 16:55                                                           ` Alexey Dobriyan
2010-11-20 17:58                                         ` Oren Laadan
2010-11-20 18:08                                   ` Oren Laadan
2010-11-20 18:08                                     ` Oren Laadan
2010-11-20 18:11                                   ` Oren Laadan
2010-11-20 18:11                                     ` Oren Laadan
     [not found]                                     ` <4CE69B8C.6050606-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-11-20 18:15                                       ` Oren Laadan
2010-11-20 18:15                                         ` Oren Laadan
2010-11-20 19:33                                         ` Tejun Heo
2010-11-21  8:18                                           ` Gene Cooperman
2010-11-21  8:18                                             ` Gene Cooperman
2010-11-21  8:21                                             ` Gene Cooperman
2010-11-22 18:02                                               ` Sukadev Bhattiprolu
2010-11-23 17:53                                               ` Oren Laadan
2010-11-24  3:50                                                 ` Kapil Arya
2010-11-25 16:04                                                   ` Oren Laadan
2010-11-29  4:09                                                     ` Gene Cooperman
2010-11-21 22:41                                             ` Grant Likely
2010-11-22 17:34                                             ` Oren Laadan
2010-11-22 17:18                                           ` Oren Laadan
2010-11-17 22:17                         ` Matt Helsley
     [not found]                           ` <20101117221713.GA27736-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-11-18 10:06                             ` Tejun Heo
2010-11-18 10:06                               ` Tejun Heo
2010-11-18 20:25                             ` Oren Laadan
2010-11-18 20:25                               ` Oren Laadan
2010-11-07 21:44               ` Oren Laadan
2010-11-07 23:31                 ` Gene Cooperman
2010-11-05 22:24       ` Oren Laadan
2010-11-04  4:03   ` Oren Laadan
2010-11-04  9:43     ` Tejun Heo
2010-11-04 12:48       ` Luck, Tony
2010-11-04 13:06         ` Tejun Heo
2010-11-06 10:12       ` Matt Helsley
2010-11-06 11:03         ` Tejun Heo
2010-11-07 22:59         ` Davide Libenzi
2010-11-08  2:32           ` david
2010-11-18 20:41             ` Oren Laadan
2010-11-05  3:55     ` Kapil Arya
2010-11-05 11:57       ` Luck, Tony
2010-11-05 17:17         ` Gene Cooperman
2010-11-06  1:16           ` Matt Helsley
2010-11-06  4:06             ` Oren Laadan
2010-11-06  5:18               ` Matt Helsley
2010-11-06 21:00           ` Oren Laadan [this message]
2010-11-05 17:31       ` Sukadev Bhattiprolu
2010-11-06 21:05       ` Oren Laadan
2010-11-08 16:55 ` Grant Likely
2010-11-08 21:01   ` Nathan Lynch
2010-11-11  6:27   ` Nathan Lynch
2010-11-17  5:29   ` Anton Blanchard
2010-11-17 11:08     ` Tejun Heo
2010-11-18  9:53     ` Alan Cox
2010-11-18 12:27       ` Alexey Dobriyan
2010-11-19  6:33     ` Gene Cooperman
2010-11-21 23:20     ` Grant Likely

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CD5C1D9.7050509@cs.columbia.edu \
    --to=orenl@cs.columbia.edu \
    --cc=gene@ccs.neu.edu \
    --cc=kapil@ccs.neu.edu \
    --cc=ksummit-2010-discuss@lists.linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.