From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sam Lang Subject: Re: Hadoop and Ceph client/mds view of modification time Date: Tue, 27 Nov 2012 11:33:10 -0600 Message-ID: <50B4F956.5000909@inktank.com> References: <50B4EE31.5020908@inktank.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-ia0-f174.google.com ([209.85.210.174]:64740 "EHLO mail-ia0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932117Ab2K0RdO (ORCPT ); Tue, 27 Nov 2012 12:33:14 -0500 Received: by mail-ia0-f174.google.com with SMTP id y25so9140665iay.19 for ; Tue, 27 Nov 2012 09:33:14 -0800 (PST) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: Noah Watkins , ceph-devel , Gregory Farnum On 11/27/2012 11:03 AM, Sage Weil wrote: > On Tue, 27 Nov 2012, Sam Lang wrote: >> Hi Noah, >> >> I was able to reproduce your issue with a similar test using the fuse client >> and the clock_offset option for the mds. This is what I see happening: >> >> clientA's clock is a few seconds behind the mds clock >> >> clientA creates the file >> - the mds sets the mtime from its current time >> - clientA acquires the exclusive capability (cap) for the file >> >> clientA writes to the file >> - the mtime is updated locally (at clientA with its current time) >> >> clientA closes the file >> - the exclusive cap is flushed to the mds, but the mtime is less >> than the create mtime because of the clock skew, so the mds >> doesn't update it to the mtime from clientA's write >> >> clientA stats the file >> - the mtime from the write (still cached) gets returned. I saw a >> race in my tests, where sometimes the mtime was from the cache >> (if the flush hadn't completed I assume), and sometimes it was >> from the mds. >> >> clientB stats the file >> - the exclusive cap is revoked at clientA, but the mtime returned >> to clientB is from the mds >> >> The goal of the current implementation is to provide an mtime that is >> non-decreasing, but that conflicts with using mtime as a version in this case. >> Using mtime as a version has its own set of problems, but I won't go into that >> here. I think there are a few alternatives if we want to try to have a more >> consistent mtime value across clients. >> >> 1. Let the client set the create mtime. This avoids the issue that the mds >> and client clocks are out of sync, but in other cases where the client has a >> clock a few seconds ahead of other clients, we run into a similar problem. >> This might be reasonable considering clients that share state will more likely >> have synchronized clocks than the clients and mds. > > I like this option the best. It will clearly break when client clocks are > out of sync and multiple clients write to the file, but I think that is > the price you pay for client-driven writeback. > > Noah, is that sufficient to resolve the hadoop race? Is there a single > client writer? If we're looking to just resolve the hadoop case, there's an even less intrusive option: do a fsync before the stat to flush (and wait for) the caps. The reason we can't just close is that the close flushes the caps, but doesn't wait for the ack. I haven't tested this, but in my tests the race between stat and close indicates it should work. -sam > >> 2. Provide a config option to always set the mtime on cap flush/revoke, even >> if its less than the current mtime. This breaks the non-decreasing behavior, >> and requires the user set a config option across the cluster if they want >> this. > > We could also do this... but if the above is sufficient for hadoop I'd > rather not. :/ > >> 3. When a client acquires the cap for a file, have the mds provide its current >> time as well. As the client updates the mtime, it uses the timestamp provided >> by the mds and the time since the cap was acquired. >> Except for the skew caused by the message latency, this approach allows the >> mtime to be based off the mds time, so it will be consistent across clients >> and the mds. It does however, allow a client to set an mtime to the future >> (based off of its local time), which might be undesirable, but that is more >> like how NFS behaves. Message latency probably won't be much of an issue >> either, as the granularity of mtime is a second. Also, the client can set its >> cap acquired timestamp to the time at which the cap was requested, ensuring >> that the relative increment includes the round trip latency so that the mtime >> will always be set further ahead. Of course, this approach would be a lot more >> intrusive to implement. :-) > > Yeah, I'm less excited about this one. > > I think that giving consistent behavior from a single client despite clock > skew is a good goal. That will make things like pjd's test behave > consistently, for example. > > sage > >> >> -sam >> >> >> On 11/20/2012 01:44 PM, Noah Watkins wrote: >>> This is a description of the clock synchronization issue we are facing >>> in Hadoop: >>> >>> Components of Hadoop use mtime as a versioning mechanism. Here is an >>> example where Client B tests the expected 'version' of a file created >>> by Client A: >>> >>> Client A: create file, write data into file. >>> Client A: expected_mtime <-- lstat(file) >>> Client A: broadcast expected_mtime to client B >>> ... >>> Client B: mtime <-- lstat(file) >>> Client B: test expected_mtime == mtime >>> >>> Since mtime may be set in Ceph by both client and MDS, inconsistent >>> mtime view is possible when clocks are not adequately synchronized. >>> >>> Here is a test that reproduces the problem. In the following output, >>> issdm-18 has the MDS, and issdm-22 is a non-Ceph node with its time >>> set to an hour earlier than the MDS node. >>> >>> nwatkins@issdm-22:~$ ssh issdm-18 date && ./test >>> Tue Nov 20 11:40:28 PST 2012 // MDS TIME >>> local time: Tue Nov 20 10:42:47 2012 // Client TIME >>> fstat time: Tue Nov 20 11:40:28 2012 // mtime seen after file >>> creation (MDS time) >>> lstat time: Tue Nov 20 10:42:47 2012 // mtime seen after file write >>> (client time) >>> >>> Here is the code used to produce that output. >>> >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> #include >>> >>> int main(int argc, char **argv) >>> { >>> struct stat st; >>> struct ceph_mount_info *cmount; >>> struct timeval tv; >>> >>> /* setup */ >>> ceph_create(&cmount, "admin"); >>> ceph_conf_read_file(cmount, "/users/nwatkins/Projects/ceph.conf"); >>> ceph_mount(cmount, "/"); >>> >>> /* print local time for reference */ >>> gettimeofday(&tv, NULL); >>> printf("local time: %s", ctime(&tv.tv_sec)); >>> >>> /* create a file */ >>> char buf[256]; >>> sprintf(buf, "/somefile.%d", getpid()); >>> int fd = ceph_open(cmount, buf, O_WRONLY|O_CREAT, 0); >>> assert(fd > 0); >>> >>> /* get mtime for this new file */ >>> memset(&st, 0, sizeof(st)); >>> int ret = ceph_fstat(cmount, fd, &st); >>> assert(ret == 0); >>> printf("fstat time: %s", ctime(&st.st_mtime)); >>> >>> /* write some data into the file */ >>> ret = ceph_write(cmount, fd, buf, sizeof(buf), -1); >>> assert(ret == sizeof(buf)); >>> ceph_close(cmount, fd); >>> >>> memset(&st, 0, sizeof(st)); >>> ret = ceph_lstat(cmount, buf, &st); >>> assert(ret == 0); >>> printf("lstat time: %s", ctime(&st.st_mtime)); >>> >>> ceph_shutdown(cmount); >>> return 0; >>> } >>> >>> Note that this output is currently using the short patch from >>> http://marc.info/?l=ceph-devel&m=133178637520337&w=2 which forces >>> getattr to always go to the MDS. >>> >>> diff --git a/src/client/Client.cc b/src/client/Client.cc >>> index 4a9ae3c..2bb24b7 100644 >>> --- a/src/client/Client.cc >>> +++ b/src/client/Client.cc >>> @@ -3858,7 +3858,7 @@ int Client::readlink(const char *relpath, char >>> *buf, loff_t \ >>> size) >>> int Client::_getattr(Inode *in, int mask, int uid, int gid) >>> { >>> - bool yes = in->caps_issued_mask(mask); >>> + bool yes = false; //in->caps_issued_mask(mask); >>> >>> ldout(cct, 10) << "_getattr mask " << ccap_string(mask) << " >>> issued=" << yes << \ >>> dendl; if (yes) >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >>