Re: No lock on RBD allow several mount on different servers...

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Josh Durgin <josh.durgin@inktank.com>
To: Gregory Farnum <greg@inktank.com>
Cc: ceph-devel <ceph-devel@vger.kernel.org>,
	Sebastien HAN <han.sebastien@gmail.com>,
	Marcus Sorensen <shadowsor@gmail.com>
Subject: Re: No lock on RBD allow several mount on different servers...
Date: Mon, 13 Aug 2012 10:22:41 -0700	[thread overview]
Message-ID: <502937E1.6010000@inktank.com> (raw)
In-Reply-To: <CAPYLRzhXmwt2AzvmUPZQWeHn75ViFTuCVq5EFwmiG+wV8cjKDQ@mail.gmail.com>

On 08/13/2012 09:55 AM, Gregory Farnum wrote:
> We've discussed some of the issues here a little bit before. See
> http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/7094 if
> you're interested.
>
> Josh, can you discuss the current status of the advisory locking?
> -Greg

Yehuda reworked it into a generic rados class so it can be used outside
of rbd. It hasn't been re-integrated with rbd yet, and I haven't looked
at it closely since the generalization. Yehuda could describe it in
more detail.

Josh

> On Sun, Aug 12, 2012 at 8:44 AM, Sage Weil <sage@inktank.com> wrote:
>> RBD image locking is on roadmap, but it's tricky.  Almost all of the
>> pieces are in place for exclusive locking of the image header, which will
>> let the user know when other nodes have the image mapped, and give them
>> the option to break their lock and take over ownership.
>>
>> The real challenge is fencing.  Unlike move conventional options like
>> SCSI, the RBD image is distributed across the entire cluster, so ensuring
>> that the old guy doesn't still have IOs in flight that will stomp on the
>> new owner means that potentially everyone needs to be informed that the
>> bad guy should be locked out.
>>
>> I think there are a few options:
>>
>> 1- The user has their own fencing or STOGITH on top of rbd, informed by
>>     the rbd locking.  Pull the plug, update your iptables, whatever.  Not
>>     very friendly.
>> 2- Extend the rados 'blacklist' functionality to let you ensure that every
>>     node in the cluster has received the updated osdmap+blacklist
>>     information, so that you can be sure no further IO from the old guy is
>>     possible.
>> 3- Use the same approach that ceph-mds fencing uses, in which the old
>>     owner isn't known to be fenced away from a particular object until the
>>     new owner reads/touches that object.
>>
>> My hope is that we can get away with #3, in which case all of the basic
>> pieces are in place and the real remaining work is integration and
>> testing.  The logic goes something like this:
>>
>> File systems write to blocks on disk in a somewhat ordered fashion.
>> After writing a bunch of data, they approach a 'consistency point' where
>> their journal and/or superblocks must be flushed and things 'commit' to
>> disk.  At that point, if the IO fails or blocks, it won't continue to
>> clobber other parts of the disk.
>>
>> When an fs in mounted, those same critical areas are read (superblock,
>> journal, etc.).  The existing client/osd interaction ensures that if
>> the new guy knows that the old guy is fenced, the act of reading
>> ensures that the relevant ceph-osds will find out too and that
>> paticular object will be fenced.
>>
>> The resulting conclusion is that if a file system (or application on top
>> of it doing direct io) is sufficiently well-behaved that will be not
>> corrupt itself when the disk reorders IOs (they do) and issues
>> barrier/flush operations at the appropriate time (in modern kernels, they
>> do), then it will work.
>>
>> I suppose it's roughly analogous to Schroedinger's cat: until the new
>> owner reads a block, it may or may not still be modified/modifiable by the
>> old guy, but as soon as it is observed, its state is known.
>>
>> What do you guys think?  If that doesn't work, I think we're stuck with
>> #2, which is expensive but doable.
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2012-08-13 17:23 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-08-11 23:50 No lock on RBD allow several mount on different servers Sébastien Han
     [not found] ` <CALFpzo49Urnf8rnFCQ=wQ8eFMR0-8FWh2=9nKrCAxb+0Xm0rVQ@mail.gmail.com>
     [not found]   ` <CAOLwVUnSUpAC69W48gbz-+7L7+p9z5tioODh_hPwVEt39GDvHw@mail.gmail.com>
     [not found]     ` <CALFpzo4X7iL6aEUtqyEBp4AMDxKkK9wtwPx35WQVauYQbe8Hng@mail.gmail.com>
2012-08-12  0:35       ` Marcus Sorensen
2012-08-12  0:53         ` Marcus Sorensen
2012-08-12  8:40           ` Sebastien HAN
2012-08-12  9:37             ` Smart Weblications GmbH - Florian Wiessner
2012-08-12 15:44             ` Sage Weil
2012-08-13 16:55               ` Gregory Farnum
2012-08-13 17:22                 ` Josh Durgin [this message]
2012-08-13 17:49                   ` Yehuda Sadeh
2012-08-15  8:18                     ` Sébastien Han

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=502937E1.6010000@inktank.com \
    --to=josh.durgin@inktank.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=greg@inktank.com \
    --cc=han.sebastien@gmail.com \
    --cc=shadowsor@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.