From: John Spray <john.spray@redhat.com>
To: Gregory Farnum <greg@gregs42.com>, Josh Durgin <jdurgin@redhat.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: RBD mirroring design draft
Date: Thu, 28 May 2015 11:42:07 +0100 [thread overview]
Message-ID: <5566F0FF.2020400@redhat.com> (raw)
In-Reply-To: <CAC6JEv_gGMnNUVFr5m2PaTum9Pa2z2YadVSJA+JBE7BvqOMKuA@mail.gmail.com>
On 28/05/2015 06:37, Gregory Farnum wrote:
> On Tue, May 12, 2015 at 5:42 PM, Josh Durgin <jdurgin@redhat.com> wrote:
>> It will need some metadata regarding positions in the journal. These
>> could be stored as omap values in a 'journal header' object in a
>> replicated pool, for rbd perhaps the same pool as the image for
>> simplicity. The header would contain at least:
>>
>> * pool_id - where journal data is stored
>> * journal_object_prefix - unique prefix for journal data objects
>> * positions - (zone, purpose, object num, offset) tuples indexed by zone
>> * object_size - approximate size of each data object
>> * object_num_begin - current earliest object in the log
>> * object_num_end - max potential object in the log
>>
>> Similar to rbd images, journal data would be stored in objects named
>> after the journal_object_prefix and their object number. To avoid
>> issues of padding or splitting journal entries, and to make it simpler
>> to keep append-only, it's easier to allow the objects to be near
>> object_size before moving to the next object number instead of
>> sticking with an exact object size.
>>
>> Ideally this underlying structure could be used for both rbd and
>> cephfs. Variable sized objects are different from the existing cephfs
>> journal, which uses fixed-size objects for striping. The default is
>> still 4MB chunks though. How important is striping the journal to
>> cephfs? For rbd it seems unlikely to help much, since updates need to
>> be batched up by the client cache anyway.
> I think the journaling v2 stuff that John did actually made objects
> variably-sized as you've described here. We've never done any sort of
> striping on the MDS journal, although I think it was
> possible.previously.
The objects are still fixed size: we talked about changing it so that
journal events would never span an object boundary, but didn't do it --
it still uses Filer.
>
>>
>> Parallelism
>> ^^^^^^^^^^^
>>
>> Mirroring many images is embarrassingly parallel. A simple unit of
>> work is an image (more specifically a journal, if e.g. a group of
>> images shared a journal as part of a consistency group in the future).
>>
>> Spreading this work across threads within a single process is
>> relatively simple. For HA, and to avoid a single NIC becoming a
>> bottleneck, we'll want to spread out the work across multiple
>> processes (and probably multiple hosts). rbd-mirror should have no
>> local state, so we just need a mechanism to coordinate the division of
>> work across multiple processes.
>>
>> One way to do this would be layering on top of watch/notify. Each
>> rbd-mirror process in a zone could watch the same object, and shard
>> the set of images to mirror based on a hash of image ids onto the
>> current set of rbd-mirror processes sorted by client gid. The set of
>> rbd-mirror processes could be determined by listing watchers.
> You're going to have some tricky cases here when reassigning authority
> as watchers come and go, but I think it should be doable.
I've been fantasizing about something similar to this for CephFS
backward scrub/recovery. My current code supports parallelism, but
relies on the user to script their population of workers across client
nodes.
I had been thinking of more of a master/slaves model, where one guy
would get to be the master by e.g. taking the lock on an object, and he
would then hand out work to everyone else that was a watch/notify
subscriber to the magic object. It seems like that could be simpler
than having workers have to work out independently what their workload
should be, and have the added bonus of providing a command-like
mechanism in addition to continuous operation.
Cheers,
John
next prev parent reply other threads:[~2015-05-28 10:42 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-05-13 0:42 RBD mirroring design draft Josh Durgin
2015-05-13 7:48 ` Haomai Wang
2015-05-13 8:07 ` Haomai Wang
2015-05-14 4:21 ` Josh Durgin
[not found] ` <CAAW3nmh+XxB8K2XsWgnD_cWWPZGw=VpsuomodMM1SNad8LmZAQ@mail.gmail.com>
2015-05-20 21:30 ` Josh Durgin
[not found] ` <CAAW3nmjWQTOOhym5t6LQ8E0P8AsHnD0c0MkfbF2zre_oUJFudw@mail.gmail.com>
2015-05-21 15:34 ` Josh Durgin
2015-05-28 5:37 ` Gregory Farnum
2015-05-28 10:42 ` John Spray [this message]
2015-05-28 14:07 ` Gregory Farnum
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5566F0FF.2020400@redhat.com \
--to=john.spray@redhat.com \
--cc=ceph-devel@vger.kernel.org \
--cc=greg@gregs42.com \
--cc=jdurgin@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.