Re: Hadoop and Ceph client/mds view of modification time

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sam Lang <sam.lang@inktank.com>
To: David Zafman <david.zafman@inktank.com>
Cc: Sage Weil <sage@inktank.com>, Noah Watkins <jayhawk@cs.ucsc.edu>,
	ceph-devel <ceph-devel@vger.kernel.org>,
	Gregory Farnum <greg@inktank.com>
Subject: Re: Hadoop and Ceph client/mds view of modification time
Date: Tue, 27 Nov 2012 15:14:07 -0600	[thread overview]
Message-ID: <50B52D1F.3090003@inktank.com> (raw)
In-Reply-To: <D5518516-618D-4C5A-A704-436CBC5502F4@inktank.com>

On 11/27/2012 01:38 PM, David Zafman wrote:
>
> On Nov 27, 2012, at 11:05 AM, Sam Lang <sam.lang@inktank.com> wrote:
>
>> On 11/27/2012 12:01 PM, Sage Weil wrote:
>>> On Tue, 27 Nov 2012, David Zafman wrote:
>>>>
>>>> On Nov 27, 2012, at 9:03 AM, Sage Weil <sage@inktank.com> wrote:
>>>>
>>>>> On Tue, 27 Nov 2012, Sam Lang wrote:
>>>>>
>>>>>> 3. When a client acquires the cap for a file, have the mds provide its current
>>>>>> time as well.  As the client updates the mtime, it uses the timestamp provided
>>>>>> by the mds and the time since the cap was acquired.
>>>>>> Except for the skew caused by the message latency, this approach allows the
>>>>>> mtime to be based off the mds time, so it will be consistent across clients
>>>>>> and the mds.  It does however, allow a client to set an mtime to the future
>>>>>> (based off of its local time), which might be undesirable, but that is more
>>>>>> like how  NFS behaves.  Message latency probably won't be much of an issue
>>>>>> either, as the granularity of mtime is a second. Also, the client can set its
>>>>>> cap acquired timestamp to the time at which the cap was requested, ensuring
>>>>>> that the relative increment includes the round trip latency so that the mtime
>>>>>> will always be set further ahead. Of course, this approach would be a lot more
>>>>>> intrusive to implement. :-)
>>>>>
>>>>> Yeah, I'm less excited about this one.
>>>>>
>>>>> I think that giving consistent behavior from a single client despite clock
>>>>> skew is a good goal.  That will make things like pjd's test behave
>>>>> consistently, for example.
>>>>>
>>>>
>>>> My suggestion is that a client writing to a file will try to use it's
>>>> local clock unless it would cause the mtime to go backward.  In that
>>>> case it will simply perform the minimum mtime advance possible (1
>>>> second?).  This handles the case in which one client created a file
>>>> using his clock (per previous suggested change), then another client
>>>> writes with a clock that is behind.
>>
>> We can choose to not decrement at the client, but because mtime is a time_t (seconds since epoch), we can't increment by 1 for each write. 1000 writes each taking 0.01s would move the mtime 990 seconds into the future.
>
> The mtime update shouldn't work that way (see below).
>
>>
>>>
>>> That's a possibility (if it's 1ms or 1ns, at least :). We need to verify
>>> what POSIX says about that, though: if you utimes(2) an mtime into the
>>> future, what happens on write(2)?
>
> On ext4 a write(2) after mtime set into the future with utimes(2) does the time go backward.  However, we can notice that if ctime == mtime then only create/write/truncate has last been done to the file.  This means that we should not let the mtime go backward in that case.  If the ctime != mtime, then the mtime has been set by utimes(2), so we can set mtime using our clock even if it goes backwards.

I'm not sure I follow you here.  utimes(2) can set mtime and ctime to 
same, different, set mtime and/or ctime to current time.  That makes it 
hard to rely on the mtime != ctime conditional.

We might be able to use the time_warp_seq field similar to how its used 
for client and mds mtime updates.  If the time_warp_seq has been 
incremented since we acquired caps, we skip the mtime increment, and 
just use current time.

I've pushed wip-mtime-incr with the mtime increment check.  There's also 
a separate commit that updates the ctime on write.

-sam

>
>>
>> According to http://pubs.opengroup.org/onlinepubs/009695399/, writes only require an update to mtime, it doesn't specify what the update should be:
>>
>> "Upon successful completion, where nbyte is greater than 0, write() shall mark for update the st_ctime and st_mtime fields of the file, and if the file is a regular file, the S_ISUID and S_ISGID bits of the file mode may be cleared."
>
> What this really means is that all writes mark mtime for update but not setting a specific time in the inode yet.  All writes/truncates will be rolled into a single mtime bump.  So even if we only have 1 second granularity (but hopefully it is 1 ms or 1 us), when a stat occurs (or in our case sending info to MDS or returning capabilities) only then does a new mtime need to be set and it will be at most 1 second ahead.
>
>>
>> In NFS, the server sets the mtime.  Its relatively common to see "Warning: file 'foo' has modification time in the future" if you're compiling on nfs and your client and nfs server clocks are skewed.  So allowing the mtime to be set in the near future would at least follow the principle of least surprise for most folks.
>
> So Ceph can see this warning too if different skewed clocks are setting mtime and it appears in the future to some clients.
>
>>
>> -sam
>>
>>>
>>> sage
>>>
>>
>

next prev parent reply	other threads:[~2012-11-27 21:14 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-20 19:44 Hadoop and Ceph client/mds view of modification time Noah Watkins
2012-11-20 20:20 ` Sam Lang
2012-11-21 16:43   ` Noah Watkins
2012-11-27 16:45 ` Sam Lang
2012-11-27 17:03   ` Sage Weil
2012-11-27 17:09     ` David Zafman
2012-11-27 18:01       ` Sage Weil
2012-11-27 19:05         ` Sam Lang
2012-11-27 19:38           ` David Zafman
2012-11-27 21:14             ` Sam Lang [this message]
2012-11-27 22:02               ` David Zafman
2012-11-27 19:59           ` Sage Weil
2012-11-27 17:33     ` Sam Lang
2012-11-27 17:07   ` Gregory Farnum
2012-11-27 17:12     ` Sam Lang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50B52D1F.3090003@inktank.com \
    --to=sam.lang@inktank.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=david.zafman@inktank.com \
    --cc=greg@inktank.com \
    --cc=jayhawk@cs.ucsc.edu \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.