From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josh Durgin Subject: Re: No lock on RBD allow several mount on different servers... Date: Mon, 13 Aug 2012 10:22:41 -0700 Message-ID: <502937E1.6010000@inktank.com> References: <06A44B2B-FA7F-418C-B9D2-D0602F4EB95B@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-qc0-f174.google.com ([209.85.216.174]:40690 "EHLO mail-qc0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752434Ab2HMRXr (ORCPT ); Mon, 13 Aug 2012 13:23:47 -0400 Received: by qcro28 with SMTP id o28so2466265qcr.19 for ; Mon, 13 Aug 2012 10:23:47 -0700 (PDT) In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum Cc: ceph-devel , Sebastien HAN , Marcus Sorensen On 08/13/2012 09:55 AM, Gregory Farnum wrote: > We've discussed some of the issues here a little bit before. See > http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/7094 if > you're interested. > > Josh, can you discuss the current status of the advisory locking? > -Greg Yehuda reworked it into a generic rados class so it can be used outside of rbd. It hasn't been re-integrated with rbd yet, and I haven't looked at it closely since the generalization. Yehuda could describe it in more detail. Josh > On Sun, Aug 12, 2012 at 8:44 AM, Sage Weil wrote: >> RBD image locking is on roadmap, but it's tricky. Almost all of the >> pieces are in place for exclusive locking of the image header, which will >> let the user know when other nodes have the image mapped, and give them >> the option to break their lock and take over ownership. >> >> The real challenge is fencing. Unlike move conventional options like >> SCSI, the RBD image is distributed across the entire cluster, so ensuring >> that the old guy doesn't still have IOs in flight that will stomp on the >> new owner means that potentially everyone needs to be informed that the >> bad guy should be locked out. >> >> I think there are a few options: >> >> 1- The user has their own fencing or STOGITH on top of rbd, informed by >> the rbd locking. Pull the plug, update your iptables, whatever. Not >> very friendly. >> 2- Extend the rados 'blacklist' functionality to let you ensure that every >> node in the cluster has received the updated osdmap+blacklist >> information, so that you can be sure no further IO from the old guy is >> possible. >> 3- Use the same approach that ceph-mds fencing uses, in which the old >> owner isn't known to be fenced away from a particular object until the >> new owner reads/touches that object. >> >> My hope is that we can get away with #3, in which case all of the basic >> pieces are in place and the real remaining work is integration and >> testing. The logic goes something like this: >> >> File systems write to blocks on disk in a somewhat ordered fashion. >> After writing a bunch of data, they approach a 'consistency point' where >> their journal and/or superblocks must be flushed and things 'commit' to >> disk. At that point, if the IO fails or blocks, it won't continue to >> clobber other parts of the disk. >> >> When an fs in mounted, those same critical areas are read (superblock, >> journal, etc.). The existing client/osd interaction ensures that if >> the new guy knows that the old guy is fenced, the act of reading >> ensures that the relevant ceph-osds will find out too and that >> paticular object will be fenced. >> >> The resulting conclusion is that if a file system (or application on top >> of it doing direct io) is sufficiently well-behaved that will be not >> corrupt itself when the disk reorders IOs (they do) and issues >> barrier/flush operations at the appropriate time (in modern kernels, they >> do), then it will work. >> >> I suppose it's roughly analogous to Schroedinger's cat: until the new >> owner reads a block, it may or may not still be modified/modifiable by the >> old guy, but as soon as it is observed, its state is known. >> >> What do you guys think? If that doesn't work, I think we're stuck with >> #2, which is expensive but doable. >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html