From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matt Helsley Subject: Re: [RFC][PATCH 2/3][cr][v2]: Checkpoint/restart file leases Date: Fri, 30 Jul 2010 18:36:53 -0700 Message-ID: <20100731013653.GC2927@count0.beaverton.ibm.com> References: <1274836063-13271-1-git-send-email-sukadev@linux.vnet.ibm.com> <1274836063-13271-3-git-send-email-sukadev@linux.vnet.ibm.com> <4C170430.2030708@cs.columbia.edu> <20100730191607.GA16238@us.ibm.com> <4C532BC9.6050109@cs.columbia.edu> <1280525871.2451.23.camel@localhost.localdomain> <4C5352EF.9080601@cs.columbia.edu> <20100731003504.GA3544@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Oren Laadan , john stultz , "Serge E. Hallyn" , Matt Helsley , matthew@wil.cx, linux-fsdevel@vger.kernel.org, Containers To: Sukadev Bhattiprolu Return-path: Received: from e6.ny.us.ibm.com ([32.97.182.146]:33643 "EHLO e6.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754360Ab0GaBg5 (ORCPT ); Fri, 30 Jul 2010 21:36:57 -0400 Received: from d01relay07.pok.ibm.com (d01relay07.pok.ibm.com [9.56.227.147]) by e6.ny.us.ibm.com (8.14.4/8.13.1) with ESMTP id o6V1aCCA016933 for ; Fri, 30 Jul 2010 21:36:12 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay07.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id o6V1atU12396404 for ; Fri, 30 Jul 2010 21:36:55 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id o6V1asaX032603 for ; Fri, 30 Jul 2010 22:36:55 -0300 Content-Disposition: inline In-Reply-To: <20100731003504.GA3544@us.ibm.com> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Fri, Jul 30, 2010 at 05:35:04PM -0700, Sukadev Bhattiprolu wrote: > Oren Laadan [orenl@cs.columbia.edu] wrote: > > > > > > john stultz wrote: > >> On Fri, 2010-07-30 at 15:45 -0400, Oren Laadan wrote: > >>> Sukadev Bhattiprolu wrote: > >>>> Oren Laadan [orenl@cs.columbia.edu] wrote: > >>>> | | | > h->fl_type = lock->fl_type; > >>>> | > + h->fl_type_prev = lock->fl_type_prev; > >>>> | > h->fl_flags = lock->fl_flags; > >>>> | > + if (h->fl_type & F_INPROGRESS && | > > >>>> + (lock->fl_break_time > jiffies)) > >>>> | > + h->fl_rem_lease = (lock->fl_break_time - jiffies) / HZ; > >>>> | | Hmmm -- we have a ctx->ktime_begin marking the start of the > >>>> checkpoint. > >>>> | It is used for relative-time calculations, for example, the expiry of > >>>> | restart-blocks and timeouts. I suggest that we use it here too to be > >>>> | consistent. > >>>> > >>>> Well, I started off using ktime_begin but discussed this with John Stultz > >>>> (CC'd here) who pointed out that mixing different domains of time is likely > >>>> to cause errors. ktime is an absolute time and fl_break_time uses jiffies - > >>>> a relative time. > >>>> > >>>> I think use of ktime_begin for restart_blocks is fine (since they use > >>>> ktime_t) but using ktime_t for file leases and converting between jiffies > >>>> and nanoseconds could be a problem, unless we convert fl_break_time to > >>>> seconds. > >>>> > >>>> IOW, can we leave the above computation of ->fl_rem_lease for now ? > >>> The data on restart_blocks keep relative time - it's the the time > >>> to expiry relative to ktime_begin (which is absolute). > >>> > >>> ktime_begin merely gives a reference point in time against which > >>> all other time-related values should be saved. The advantage is > >>> that all time computation are relative to the same point in time > >>> at checkpoint/restart - the time when the ktime_begin is set. It's > >>> more consistent this way. > >> > >> First, forgive me for not being very aware of the checkpoint/restart > >> code. > > > > On the contrary, forgive me if I'm stating the obvious below ... > > > >> > >> So, ktime_begin is an absolute CLOCK_MONOTONIC time, relative to the > >> time the system booted (more or less). And it represents the checkpoint > >> time, correct? > > > > As a rule, all time measurements in the checkpoint image are > > saved as relative values, using the start-of-checkpoint as the > > reference point in time (*). > > > > So at checkpoint, every absolute time value should be converted > > to a value relative to the start-of-checkpoint. At restart, every > > relative time value from the image is converted back (if needed) > > to an absolute time value using a respective start-of-restart. > > One general observation, slightly off-topic. You mention > "start-of-restart" here and ... > > > > This makes sense for the most common case, where if a process > > had 5 more seconds to sleep at the time of checkpoint, we would > > like it to have 5 more seconds to sleep after it restarts. > > ... "after it restarts" here. These two can be quite different if, > as you mention below, the C/R is a lengthy operation. Worse! I think the big concern is not the duration of checkpoint but the amount of time userspace expects to store a checkpoint before restarting. It's an arbitrary amount of time. A user could restart 50 days later. Or 100. etc. Sure, it's _unlikely_, but we (checkpoint/restart implementers) shouldn't count on seeing only "reasonable" times between completion of checkpoint and initiation of restart. We need to be robust for all the times we see. I think that's why Oren made the choices he did. Making times relative to ktime_begin certainly helps keep the time values semi-sane for those crazy arbitrary cases in addition to being nice for the "reasonable" cases. Cheers, -Matt