From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sam Lang Subject: Re: Hadoop and Ceph client/mds view of modification time Date: Tue, 27 Nov 2012 11:12:12 -0600 Message-ID: <50B4F46C.7090504@inktank.com> References: <50B4EE31.5020908@inktank.com> <8707A447F7754068A1D89D240265E1B7@inktank.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mail-ie0-f174.google.com ([209.85.223.174]:51908 "EHLO mail-ie0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752672Ab2K0RMQ (ORCPT ); Tue, 27 Nov 2012 12:12:16 -0500 Received: by mail-ie0-f174.google.com with SMTP id k11so8570965iea.19 for ; Tue, 27 Nov 2012 09:12:16 -0800 (PST) In-Reply-To: <8707A447F7754068A1D89D240265E1B7@inktank.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum Cc: Noah Watkins , ceph-devel , Sage Weil On 11/27/2012 11:07 AM, Gregory Farnum wrote: > On Tuesday, November 27, 2012 at 8:45 AM, Sam Lang wrote: >> >> Hi Noah, >> >> I was able to reproduce your issue with a similar test using the fus= e >> client and the clock_offset option for the mds. This is what I see >> happening: >> >> clientA's clock is a few seconds behind the mds clock >> >> clientA creates the file >> - the mds sets the mtime from its current time >> - clientA acquires the exclusive capability (cap) for the file >> >> clientA writes to the file >> - the mtime is updated locally (at clientA with its current time) >> >> clientA closes the file >> - the exclusive cap is flushed to the mds, but the mtime is less >> than the create mtime because of the clock skew, so the mds >> doesn't update it to the mtime from clientA's write >> >> clientA stats the file >> - the mtime from the write (still cached) gets returned. I saw a >> race in my tests, where sometimes the mtime was from the cache >> (if the flush hadn't completed I assume), and sometimes it was >> from the mds. >> >> clientB stats the file >> - the exclusive cap is revoked at clientA, but the mtime returned >> to clientB is from the mds > > Hurray, I think we all agree about what's happening now! :) > > Have you checked to see if the MDS ever sets mtime after create, or i= s it always dictated by the client following that? It sets it on truncate as well. -sam > >> >> The goal of the current implementation is to provide an mtime that i= s >> non-decreasing, but that conflicts with using mtime as a version in = this >> case. Using mtime as a version has its own set of problems, but I wo= n't >> go into that here. I think there are a few alternatives if we want t= o >> try to have a more consistent mtime value across clients. >> >> 1. Let the client set the create mtime. This avoids the issue that t= he >> mds and client clocks are out of sync, but in other cases where the >> client has a clock a few seconds ahead of other clients, we run into= a >> similar problem. This might be reasonable considering clients that >> share state will more likely have synchronized clocks than the clien= ts >> and mds. >> >> 2. Provide a config option to always set the mtime on cap flush/revo= ke, >> even if its less than the current mtime. This breaks the non-decreas= ing >> behavior, and requires the user set a config option across the clust= er >> if they want this. >> >> 3. When a client acquires the cap for a file, have the mds provide i= ts >> current time as well. As the client updates the mtime, it uses the >> timestamp provided by the mds and the time since the cap was acquire= d. >> Except for the skew caused by the message latency, this approach all= ows >> the mtime to be based off the mds time, so it will be consistent acr= oss >> clients and the mds. It does however, allow a client to set an mtime= to >> the future (based off of its local time), which might be undesirable= , >> but that is more like how NFS behaves. Message latency probably won'= t >> be much of an issue either, as the granularity of mtime is a second. >> Also, the client can set its cap acquired timestamp to the time at w= hich >> the cap was requested, ensuring that the relative increment includes= the >> round trip latency so that the mtime will always be set further ahea= d. >> Of course, this approach would be a lot more intrusive to implement.= :-) > > I actually like this third approach of letting the MDS be authoritati= ve about time even if it's not directly involved. Given that, I wonder = if perhaps the client should just have time translation functions it us= es everywhere? > However, the problem with that is that the different MDS daemons migh= t also disagree about time. Perhaps they could adopt the master MDS clo= ck or something skanky like that=E2=80=A6 :/ > > The fundamental issue of time resolution is why things are the way th= ey are now, of course =E2=80=94 you usually don't want things going "ba= ckwards" in time, but clock skews are a real problem in large clusters = so we just decided to not let things go back on the assumption it would= be a temporary and little-noticed disparity, with easy-to-understand b= ehavior. Obviously we were wrong about it being little-noticed. > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html