Re: RBD mirroring design draft

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Josh Durgin <jdurgin@redhat.com>
To: Haomai Wang <haomaiwang@gmail.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: RBD mirroring design draft
Date: Wed, 13 May 2015 21:21:48 -0700	[thread overview]
Message-ID: <555422DC.6050400@redhat.com> (raw)
In-Reply-To: <CACJqLyZA2+eXd=cqDwYOyqeE+O+kovHOF=HavGp=SWkrJ-LgVw@mail.gmail.com>

On 05/13/2015 01:07 AM, Haomai Wang wrote:
> On Wed, May 13, 2015 at 8:42 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>> Some other possible optimizations:
>> * reading a large window of the journal to coalesce overlapping writes
>> * decoupling reading from the leader zone and writing to follower zones,
>> to allow optimizations like compression of the journal or other
>> transforms as data is sent, and relaxing the requirement for one node
>> to be directly connected to more than one ceph cluster
>
> Maybe we could add separate NIC/network support which only used to
> write journaling data to journaling pool? From my mind, a multi-site
> cluster always need another low-latency fiber.

Yeah, this seems desirable. It seems like it'd be possible based on the 
way the NICs and routing tables are setup, without needing any special
configuration from ceph, or am I missing something?

>> Failover
>> --------
>>
>> Watch/notify could also be used (via a predetermined object) to
>> communicate with rbd-mirror processes to get sync status from each,
>> and for managing failover.
>>
>> Failing over means preventing changes in the original leader zone, and
>> making the new leader zone writeable. The state of a zone (read-only vs
>> writeable) could be stored in a zone's metadata in rados to represent
>> this, and images with the journal feature bit could check this before
>> being opened read/write for safety. To make it race-proof, the zone
>> state can be a tri-state - read-only, read-write, or changing.
>>
>> In the original leader zone, if it is still running, the zone would be
>> set to read-only mode and all clients could be blacklisted to avoid
>> creating too much divergent history to rollback later.
>>
>> In the new leader zone, the zone's state would be set to 'changing',
>> and rbd-mirror processes would be told to stop copying from the
>> original leader and close the images they were mirroring to.  New
>> rbd-mirror processes should refuse to start mirroring when the zone is
>> not read-only. Once the mirroring processes have stopped, the zone
>> could be set to read-write, and begin normal usage.
>>
>> Failback
>> ^^^^^^^^
>>
>> In this scenario, after failing over, the original leader zone (A)
>> starts running again, but needs to catch up to the current leader
>> (B). At a high level, this involves syncing up the image by rolling
>> back the updates in A past the point B synced to as noted in an
>> images's journal in A, and mirroring all the changes since then from
>> B.
>>
>> This would need to be an offline operation, since at some point
>> B would need to go read-only before A goes read-write. Making this
>> transition online is outside the scope of mirroring for now, since it
>> would require another level of indirection for rbd users like QEMU.
>
> So do you mean when primary zone failed we need to switch primary zone
> offline by hand?

I think we'd want to have some higher-level script controlling it, with
a pluggable trigger that could be based on user-defined monitoring.

This is something I'm less sure of though, it'd be good to get more
feedback on what users are interested in here. Would ceph detecting 
failure based on e.g. rbd-mirror timing out reads from the leader zone
be good enough for most users?

next prev parent reply	other threads:[~2015-05-14  4:21 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-13  0:42 RBD mirroring design draft Josh Durgin
2015-05-13  7:48 ` Haomai Wang
2015-05-13  8:07 ` Haomai Wang
2015-05-14  4:21   ` Josh Durgin [this message]
     [not found]     ` <CAAW3nmh+XxB8K2XsWgnD_cWWPZGw=VpsuomodMM1SNad8LmZAQ@mail.gmail.com>
2015-05-20 21:30       ` Josh Durgin
     [not found]         ` <CAAW3nmjWQTOOhym5t6LQ8E0P8AsHnD0c0MkfbF2zre_oUJFudw@mail.gmail.com>
2015-05-21 15:34           ` Josh Durgin
2015-05-28  5:37 ` Gregory Farnum
2015-05-28 10:42   ` John Spray
2015-05-28 14:07     ` Gregory Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=555422DC.6050400@redhat.com \
    --to=jdurgin@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=haomaiwang@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.