From: Joao Eduardo Luis <joao.luis@inktank.com>
To: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Cc: florian@hastexo.com, Greg Farnum <greg@inktank.com>
Subject: Comments on Ceph.com's blog article 'Ceph's New Monitor Changes'
Date: Mon, 11 Mar 2013 12:04:03 +0000 [thread overview]
Message-ID: <513DC833.9090806@inktank.com> (raw)
Last Friday, Florian Haas (CC'ed) commented with regard to the Monitor
changes blog post [1] on Google+ [2]. This is a transcription of the
resulting thread, in which Greg (CC'ed) also participated, and I am now
cross-posting it to the list for the benefit of the larger community on
ceph-devel that might not stumble upon the post (although it is public
and should not require a G+ account).
[1] - http://ceph.com/dev-notes/cephs-new-monitor-changes/
[2] - https://plus.google.com/110443614427234590648/posts/iuxSyCC5aJp
-Joao
> Florian Haas on Mar 8, 2013 wrote:
>
> Good to see more Ceph developers providing their insight on recent
> and ongoing codebase changes.
>
> +Joao Eduardo Luis, I have a comment about the transition from the
> file-backed mon store to a leveldb k/v store. As the original
> reporter of http://tracker.ceph.com/issues/2752, I'm always
> wondering how to best recover if any issue simultaneously bringing
> down all mons were ever to happen again. At the time, +Gregory
> Farnum's suggestion for recovery
> (http://marc.info/?l=ceph-devel&m=134151077312444&w=2) involved
> manipulating files in the mon data directory manually; I wonder if
> the leveldb approach makes this easier or harder.
>
> It tends to be a common source of discomfort among potential Ceph
> users that if their mons ever become unrecoverable, it's almost
> impossible to recover your data (compare to GlusterFS, where you can
> always pull data out of Gluster bricks unharmed, at least as long as
> you don't use striping volumes). With a file backed mon store, I had
> hoped that eventually this might tie into btrfs snapshots such that
> you would have been able to roll back to a known good configuration
> in an emergency. With the switch to leveldb, I no longer foresee that
> ever happening. Mind sharing your thoughts on that?
> Gregory Farnum on Mar 8, 2013 wrote:
>
> Actually LevelDB snapshots just fine.
>
> I'll leave the rest for Joao, or me after I've been awake for more
> than a minute. :)
> Florian Haas on Mar 8, 2013 wrote:
>
> Thanks. So leveldb updates are always atomic, consistent, isolated
> and durable?
>
> Also, do the mons open level leveldb fd with O_SYNC, or do they
> periodically do fsync() or fdatasync()?
> Gregory Farnum on Mar 8, 2013 wrote:
>
> I'm not entirely sure -- leveldb handles all that stuff on its own
> and provides the right guarantees at the interface level, so I assume its
> doing those things on its own. ;)
> (In particular it's actually a hierarchy, not a single file.)
> Joao Eduardo Luis on Mar 9, 2013 wrote:
>
> When it comes to manipulating the contents of the store, yeah, it's
> harder now. You'll need a tool that speaks leveldb for that. Which IMO
> can actually be nice if we have a tool that allows one to perform minor
> incursions in the store in a somewhat automated fashion (say, revert to
> an older osdmap) -- and creating a tool to change the store's contents
> isn't hard to do either, we just didn't get around to put it together
> (there's one however to read the store's contents, adapting it should be
> easy).
>
> And I can see the discomfort in having the mon store on leveldb instead
> of the FS. I had never thought about backing up the mon store resorting
> to, say, btrfs snapshots, as I've been approaching it on a 'distributed
> is path to success' and maintaining multiple mons around. So I guess
> that it also might make things harder, if btrfs snapshots (or any other
> snapshot tools for that matter) don't play nice with leveldb (or
> vice-versa). I looked on the internets for a potential solution, and it
> looks like someone at stackoverflow [1] recommended performing some
> copies and creating some hard links of some of leveldb's contents to
> achieve that -- it's not a pretty solution and would require the monitor
> to block accesses to the store while performing such operation.
>
> Also, it is worth to mention that leveldb does support snapshots, but
> they will all be lost after leveldb is closed, so there's no gain from
> such support when attempting to create checkpoints (unless there's
> something I've missed altogether and this is in fact possible!).
> Furthermore, operations are applied in batches (we have a nifty
> interface that abstracts them as transactions, but they're not really
> ACID transactions, although they benefit from Atomicity, Durability and
> some form of Consistency and limited Isolation [2]), and we force them
> to be written synchronously to leveldb. This basically means that if the
> whole batch successfully reaches the disk, everything should be okay; if
> the system crashes sometime in the middle of it, leveldb will
> automatically ignore partial writes, and for a batch of operations that
> will mean the whole batch.
>
> By the way, I believe this discussion should be moved to ceph-devel, as
> it can be beneficial to other members of the community. If no one
> objects, I will do as such later today.
>
> [1] - http://goo.gl/7HSCH
> [2] - Atomicity is internally guaranteed by leveldb, resorting to their
> own magic; Consistency as well, guaranteed by ignoring partial writes if
> a system should fail, but leveldb mostly leaves it up to the user wrt
> how things are flushed to disk; Isolation is fairly limited: you can
> only have one process accessing leveldb at each time, but multiple
> threads can do as they please, and although leveldb will take care of
> most of the required synchronization, some thought should still go into
> that; and Durability is said to be configurable, but I have no idea what
> that means -- am yet to find a way to make not durable, not that I
> really need it, but would be nice to know. :-)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next reply other threads:[~2013-03-11 12:04 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-03-11 12:04 Joao Eduardo Luis [this message]
2013-03-12 17:54 ` Comments on Ceph.com's blog article 'Ceph's New Monitor Changes' Mark Kampe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=513DC833.9090806@inktank.com \
--to=joao.luis@inktank.com \
--cc=ceph-devel@vger.kernel.org \
--cc=florian@hastexo.com \
--cc=greg@inktank.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.